Details

Reviewers

kib
alc
mjg

Commits

rS311346: Add a small allocator for exec_map entries.

Summary

With some measures in place to reduce free page and page queue lock
contention, we see a lot of kernel map lock contention during
highly parallel builds. This turned out to be the result of parallel
calls to exec_alloc_args(), which allocates a large KVA range that
is released at the end of the execve call. We currently always
reserve 16 * 260KB for this purpose in a submap. This is excessive on
small systems and not enough on large systems. Moreover, exec_map is
used mainly for execve arguments, meaning that each allocation is the
same size, so the overhead of maintaining the vm_map splay tree is
mostly unnecessary.

This change modifies the exec_map sizing policy so that it depends on
the number of CPUs. It also allows an execve argument range to be cached
per CPU, letting execve avoid the exec_map sx lock in the common case.

Test Plan

On a system with 16 cores, I counted the number of cache misses and the
total number of kern_execve() calls during a -j N buildkernel.

N = 8: 66/11387
N = 16: 243/11400
N = 32: 722/11389
N = 64: 1059/11391
N = 128: 1212/11387

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

markj updated this revision to Diff 23289.Dec 27 2016, 3:04 AM

markj retitled this revision from to Opportunistically cache KVA for execve arguments.

markj edited the test plan for this revision. (Show Details)

markj updated this object.

markj updated this object.Dec 27 2016, 3:11 AM

markj edited the test plan for this revision. (Show Details)

markj added reviewers: alc, kib, mjg.Dec 27 2016, 3:13 AM

I think that caching allocated KVA is fine, but I suspect that also caching the swap pages behind the KVA is not needed. Perhaps you can free the backing object or only the pages from its queue. Or might be, issue vm_object_madvise(MADV_FREE) on the object ?

sys/vm/vm_init.c
95 ↗	(On Diff #23289)	Why is this tunable removed ?
275 ↗	(On Diff #23289)	I believe that ncpu * 2 is too low value for some workloads, in particular, for the runs where lot of parallel shells are executing commands. I would think that say 8ncpu or 16ncpu is more reasonable auto-tuning, perhaps capped to some absolute number on 32bit arches which starve for KVA.

In D8921#184646, @kib wrote:

I think that caching allocated KVA is fine, but I suspect that also caching the swap pages behind the KVA is not needed. Perhaps you can free the backing object or only the pages from its queue. Or might be, issue vm_object_madvise(MADV_FREE) on the object ?

I like the idea of using madvise(MADV_FREE) here, but don't see how to do so without reintroducing a bottleneck, as there's no direct way to get the object for a given buffer without acquiring the map lock. In particular, vm_map_madvise(MADV_FREE) must acquire the map read lock to look up the entry. I believe each KVA range allocated from the exec_map will have a unique swap object allocated for it, so contention on the object lock in vm_object_madvise() is not a concern. The object point could perhaps be cached with each KVA range, using vm_map_lookup_entry(), but this is a hack. Do you have any suggestions here?

sys/vm/vm_init.c
95 ↗	(On Diff #23289)	It didn't make sense to me to keep this parameter configurable now that we attempt to autotune. Also, misconfiguration can lead to deadlock: when a thread frees a KVA range to the per-CPU cache, it does not wake up threads waiting in kmap_alloc_wait(), so there needs to be at least NCPU+1 ranges already allocated to ensure that at least one thread calls kmap_free_wakeup() in exec_free_args().
275 ↗	(On Diff #23289)	exec_map usage is effectively bounded by NCPU since the KVA and backing pages are released before execution of the new image begins, so there's a risk of starvation only if the thread is preempted or blocks in copyin() after calling exec_alloc_args(). 8NCPU seems quite high to me given the stats I listed in the "testing" section. Is there some consideration I'm missing?

In D8921#184747, @markj wrote:

In D8921#184646, @kib wrote:

I think that caching allocated KVA is fine, but I suspect that also caching the swap pages behind the KVA is not needed. Perhaps you can free the backing object or only the pages from its queue. Or might be, issue vm_object_madvise(MADV_FREE) on the object ?

I like the idea of using madvise(MADV_FREE) here, but don't see how to do so without reintroducing a bottleneck, as there's no direct way to get the object for a given buffer without acquiring the map lock. In particular, vm_map_madvise(MADV_FREE) must acquire the map read lock to look up the entry. I believe each KVA range allocated from the exec_map will have a unique swap object allocated for it, so contention on the object lock in vm_object_madvise() is not a concern. The object point could perhaps be cached with each KVA range, using vm_map_lookup_entry(), but this is a hack. Do you have any suggestions here?

I will try to reply to both main comment and inline comments in one text, hopefully this is digestable.

First, execve(2) call in kernel has many (too much) sleep points. There is a lot of places where vm_map lock for the old/new vmspace is taken, both explicit in the kern_exec.c and imgact_elf.c code and implicit on page faults, binary and elf interpreter vnode locks. sleeps for the buffer locks and sf sleeps on 32bit arches. As indirect proof of my statement, please see r282708, which was very easy to trigger.

In other words, I think you only see the <= ncpu parallel execs because you most likely used make -j ncpu. Would the number of jobs increased, you would get more execs than cpus, and many of that threads appear sleeping inside the exec code, not at the outer boundary of strings copyin or copyout KVA allocations. More, I think that with the freshly allocated strings memory, which was not taked in the warmed state from the per-cpu cache, you would get page faults on the object and page allocations for exec map, re-introducing the exec map lock contention.

The above is simultaneously explanation for my request to not limit the exec map size to 2*ncpu areas, and to not consider e.g. map read lock taken to find the entry and backing object for the entry, as re-introducing much contention. It could be, but only for make -j ncpu. I agree that it must be not less than 2*ncpu, but IMO it should be larger.

WRT caching the backing object together with the KVA, this might be a good solution. You could also store the object somewhere in struct image_params or image_args, filling the value nearby other strings KVA accesses which are likely to fault. Then exec_free_args() could avoid taking the map lock.

In D8921#184798, @kib wrote:

I will try to reply to both main comment and inline comments in one text, hopefully this is digestable.

First, execve(2) call in kernel has many (too much) sleep points. There is a lot of places where vm_map lock for the old/new vmspace is taken, both explicit in the kern_exec.c and imgact_elf.c code and implicit on page faults, binary and elf interpreter vnode locks. sleeps for the buffer locks and sf sleeps on 32bit arches. As indirect proof of my statement, please see r282708, which was very easy to trigger.

In other words, I think you only see the <= ncpu parallel execs because you most likely used make -j ncpu. Would the number of jobs increased, you would get more execs than cpus, and many of that threads appear sleeping inside the exec code, not at the outer boundary of strings copyin or copyout KVA allocations. More, I think that with the freshly allocated strings memory, which was not taked in the warmed state from the per-cpu cache, you would get page faults on the object and page allocations for exec map, re-introducing the exec map lock contention.

Note that the testing I reported included, e.g., a -j 128 buildkernel on a 16-core system. In that somewhat extreme case, the per-CPU cache still had a decent hit rate (close to 90%). However, that was with favourable conditions.

The above is simultaneously explanation for my request to not limit the exec map size to 2*ncpu areas, and to not consider e.g. map read lock taken to find the entry and backing object for the entry, as re-introducing much contention. It could be, but only for make -j ncpu. I agree that it must be not less than 2*ncpu, but IMO it should be larger.

Ok. I did some buildkernel testing with sleep points in kern_execve() instrumented and found that it's indeed quite common to end up sleeping on vnode locks (usually in the name cache or the image activator). It varies quite a lot depending on the setup - with NFS root, we sleep at some point in every execve, while with a local SSD, approximately 10% of execve calls sleep somewhere. So 2*ncpu now seems low to me as well.

make -j ncpu is the main target of this optimization, so I rather dislike the prospect of acquiring the map read lock in the fast path.

WRT caching the backing object together with the KVA, this might be a good solution. You could also store the object somewhere in struct image_params or image_args, filling the value nearby other strings KVA accesses which are likely to fault. Then exec_free_args() could avoid taking the map lock.

I'll try to find a clean way to do this. Do you consider this aspect (applying MADV_FREE) to be a prerequisite for this change?

In D8921#185180, @markj wrote:

In D8921#184798, @kib wrote:

WRT caching the backing object together with the KVA, this might be a good solution. You could also store the object somewhere in struct image_params or image_args, filling the value nearby other strings KVA accesses which are likely to fault. Then exec_free_args() could avoid taking the map lock.

I'll try to find a clean way to do this. Do you consider this aspect (applying MADV_FREE) to be a prerequisite for this change?

How about this: instead of adding extra complexity and shuffling to the fast path, add a lowmem handler that iterates over the cache entries and calls vm_map_madvise(MADV_FREE) on each one. Then they will be quickly reclaimed by the page daemon immediately after.

In D8921#185191, @markj wrote:

In D8921#185180, @markj wrote:

In D8921#184798, @kib wrote:

WRT caching the backing object together with the KVA, this might be a good solution. You could also store the object somewhere in struct image_params or image_args, filling the value nearby other strings KVA accesses which are likely to fault. Then exec_free_args() could avoid taking the map lock.

I'll try to find a clean way to do this. Do you consider this aspect (applying MADV_FREE) to be a prerequisite for this change?

How about this: instead of adding extra complexity and shuffling to the fast path, add a lowmem handler that iterates over the cache entries and calls vm_map_madvise(MADV_FREE) on each one. Then they will be quickly reclaimed by the page daemon immediately after.

Consider usual modern desktop with 8 hw threads. There, with the args size of ~200Kb, you get 1.5M of garbage cached. For larger server class machine with e.g. 32 threads, the lost memory is 6M. I tried to agree with this, but I do not like it.

Lets take a step out and look at the exec_map. It is relatively small, allocations from it are only done in the fixed size, and for single purpose. I believe that a viable strategy there to get rid of the vm_map lock contention, while also avoiding sinking in useless garbage, is to preallocate all args buffers KVA and their backing objects in advance. Since these structures are never freed, and the cache elements are never freed as well, it is very easy to implement lock-less list of the free elements, with fall back to the sleep wait in case of the cache depletion. Since vm_map_entries are stable, you can directly dereference map->object.vm_object for madvise.

Note that we do need exec_map to be real vm_map, so that vm_fault() worked.

Preallocate exec args KVA during boot

Cache the corresponding map entries in a list protected by a mutex.
When entries are freed, vm_object_madvise() ensures that the backing
pages are marked clean and can be reclaimed quickly. The mutex also
synchronizes sleep/wakeup when the cache is empty, and a single-entry
per-CPU cache often lets us avoid the global free list entirely.

Use pmap_advise() to clear MD reference bits
Fix casts

I am somewhat curious whether we still boot on very large modern machine with i386 kernel either normal or PAE pmap. I know that the stock PAE kernel has edge issues with 16G configs. I do not suggest you to actually test that.

sys/kern/kern_exec.c
1346 ↗	(On Diff #23504)	Why do you need to manually preallocate the object ? Default object will be automatically created on the first page fault. If you trying to save vm_map lock upgrades in the page faults, IMO it is not worth the efforts. There are quite small number of exec_map_entries, and they are instantiated quite rapidly.

This revision is now accepted and ready to land.Jan 1 2017, 10:53 AM

markj added inline comments.Jan 2 2017, 12:52 AM

sys/kern/kern_exec.c
1346 ↗	(On Diff #23504)	The object is actually otherwise created by the vm_map_lookup() call below. Here I'm trying to ensure that each KVA range gets its own object - vm_map_insert() will coalesce the ranges into a single entry if no object is specified, so all the exec map entries will belong to the same object and thus the vm_object_madvise() calls will all acquire the same write lock. I think I'm trying overly hard to avoid global locks in exec_free_args_kva(), and I'm not succeeding anyway: pmap_advise() acquires the kernel pmap mutex, and vm_page_advise() acquires the inactive queue lock to requeue each page. For now I think I'll remove the vm_object preallocation and just use vm_map_advise() instead of pmap_advise()+vm_object_madvise(). vm_map_madvise(MADV_FREE) requires only the map read lock, and vm_object_madvise() can be optimized somewhat for my use case. For instance, I believe it only needs the write lock for OBJT_SWAP objects, and it can avoid a bunch of vm_page_lookup() calls on non-resident pages in the case that there's no backing object.

Avoid keeping track of exec map entries and just use vm_map_madvise()

This revision now requires review to proceed.Jan 2 2017, 2:23 AM

kib accepted this revision.Jan 2 2017, 8:14 AM

kib edited edge metadata.

This revision is now accepted and ready to land.Jan 2 2017, 8:14 AM

kib added inline comments.Jan 2 2017, 9:14 AM

sys/kern/kern_exec.c
1346 ↗	(On Diff #23504)	You may avoid coalesce by creating vm_map_entries top down.

Closed by commit rS311346: Add a small allocator for exec_map entries. (authored by markj). · Explain WhyJan 5 2017, 1:44 AM

This revision was automatically updated to reflect the committed changes.

Opportunistically cache KVA for execve arguments
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 23611

head/sys/kern/kern_exec.c

head/sys/sys/imgact.h

head/sys/vm/vm_init.c

head/sys/vm/vm_kern.h

head/sys/vm/vm_kern.c

Opportunistically cache KVA for execve argumentsClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 23611

head/sys/kern/kern_exec.c

head/sys/sys/imgact.h

head/sys/vm/vm_init.c

head/sys/vm/vm_kern.h

head/sys/vm/vm_kern.c

Opportunistically cache KVA for execve arguments
ClosedPublic
Actions

Revision Contents
Changeset List