Page MenuHomeFreeBSD

Opportunistically cache KVA for execve arguments
ClosedPublic

Authored by markj on Dec 27 2016, 3:04 AM.
Tags
None
Referenced Files
F109199475: D8921.id23504.diff
Sat, Feb 1, 11:55 PM
Unknown Object (File)
Sat, Feb 1, 11:23 AM
Unknown Object (File)
Sun, Jan 26, 7:27 AM
Unknown Object (File)
Tue, Jan 14, 3:33 AM
Unknown Object (File)
Thu, Jan 9, 7:16 PM
Unknown Object (File)
Wed, Jan 8, 12:17 AM
Unknown Object (File)
Dec 26 2024, 3:49 PM
Unknown Object (File)
Dec 14 2024, 12:54 AM
Subscribers
None

Details

Summary

With some measures in place to reduce free page and page queue lock
contention, we see a lot of kernel map lock contention during
highly parallel builds. This turned out to be the result of parallel
calls to exec_alloc_args(), which allocates a large KVA range that
is released at the end of the execve call. We currently always
reserve 16 * 260KB for this purpose in a submap. This is excessive on
small systems and not enough on large systems. Moreover, exec_map is
used mainly for execve arguments, meaning that each allocation is the
same size, so the overhead of maintaining the vm_map splay tree is
mostly unnecessary.

This change modifies the exec_map sizing policy so that it depends on
the number of CPUs. It also allows an execve argument range to be cached
per CPU, letting execve avoid the exec_map sx lock in the common case.

Test Plan

On a system with 16 cores, I counted the number of cache misses and the
total number of kern_execve() calls during a -j N buildkernel.

N = 8: 66/11387
N = 16: 243/11400
N = 32: 722/11389
N = 64: 1059/11391
N = 128: 1212/11387

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj retitled this revision from to Opportunistically cache KVA for execve arguments.
markj edited the test plan for this revision. (Show Details)
markj updated this object.
markj edited the test plan for this revision. (Show Details)

I think that caching allocated KVA is fine, but I suspect that also caching the swap pages behind the KVA is not needed. Perhaps you can free the backing object or only the pages from its queue. Or might be, issue vm_object_madvise(MADV_FREE) on the object ?

sys/vm/vm_init.c
95 ↗(On Diff #23289)

Why is this tunable removed ?

275 ↗(On Diff #23289)

I believe that ncpu * 2 is too low value for some workloads, in particular, for the runs where lot of parallel shells are executing commands. I would think that say 8*ncpu or 16*ncpu is more reasonable auto-tuning, perhaps capped to some absolute number on 32bit arches which starve for KVA.

In D8921#184646, @kib wrote:

I think that caching allocated KVA is fine, but I suspect that also caching the swap pages behind the KVA is not needed. Perhaps you can free the backing object or only the pages from its queue. Or might be, issue vm_object_madvise(MADV_FREE) on the object ?

I like the idea of using madvise(MADV_FREE) here, but don't see how to do so without reintroducing a bottleneck, as there's no direct way to get the object for a given buffer without acquiring the map lock. In particular, vm_map_madvise(MADV_FREE) must acquire the map read lock to look up the entry. I believe each KVA range allocated from the exec_map will have a unique swap object allocated for it, so contention on the object lock in vm_object_madvise() is not a concern. The object point could perhaps be cached with each KVA range, using vm_map_lookup_entry(), but this is a hack. Do you have any suggestions here?

sys/vm/vm_init.c
95 ↗(On Diff #23289)

It didn't make sense to me to keep this parameter configurable now that we attempt to autotune. Also, misconfiguration can lead to deadlock: when a thread frees a KVA range to the per-CPU cache, it does not wake up threads waiting in kmap_alloc_wait(), so there needs to be at least NCPU+1 ranges already allocated to ensure that at least one thread calls kmap_free_wakeup() in exec_free_args().

275 ↗(On Diff #23289)

exec_map usage is effectively bounded by NCPU since the KVA and backing pages are released before execution of the new image begins, so there's a risk of starvation only if the thread is preempted or blocks in copyin*() after calling exec_alloc_args(). 8*NCPU seems quite high to me given the stats I listed in the "testing" section. Is there some consideration I'm missing?

In D8921#184747, @markj wrote:
In D8921#184646, @kib wrote:

I think that caching allocated KVA is fine, but I suspect that also caching the swap pages behind the KVA is not needed. Perhaps you can free the backing object or only the pages from its queue. Or might be, issue vm_object_madvise(MADV_FREE) on the object ?

I like the idea of using madvise(MADV_FREE) here, but don't see how to do so without reintroducing a bottleneck, as there's no direct way to get the object for a given buffer without acquiring the map lock. In particular, vm_map_madvise(MADV_FREE) must acquire the map read lock to look up the entry. I believe each KVA range allocated from the exec_map will have a unique swap object allocated for it, so contention on the object lock in vm_object_madvise() is not a concern. The object point could perhaps be cached with each KVA range, using vm_map_lookup_entry(), but this is a hack. Do you have any suggestions here?

I will try to reply to both main comment and inline comments in one text, hopefully this is digestable.

First, execve(2) call in kernel has many (too much) sleep points. There is a lot of places where vm_map lock for the old/new vmspace is taken, both explicit in the kern_exec.c and imgact_elf.c code and implicit on page faults, binary and elf interpreter vnode locks. sleeps for the buffer locks and sf sleeps on 32bit arches. As indirect proof of my statement, please see r282708, which was very easy to trigger.

In other words, I think you only see the <= ncpu parallel execs because you most likely used make -j ncpu. Would the number of jobs increased, you would get more execs than cpus, and many of that threads appear sleeping inside the exec code, not at the outer boundary of strings copyin or copyout KVA allocations. More, I think that with the freshly allocated strings memory, which was not taked in the warmed state from the per-cpu cache, you would get page faults on the object and page allocations for exec map, re-introducing the exec map lock contention.

The above is simultaneously explanation for my request to not limit the exec map size to 2*ncpu areas, and to not consider e.g. map read lock taken to find the entry and backing object for the entry, as re-introducing much contention. It could be, but only for make -j ncpu. I agree that it must be not less than 2*ncpu, but IMO it should be larger.

WRT caching the backing object together with the KVA, this might be a good solution. You could also store the object somewhere in struct image_params or image_args, filling the value nearby other strings KVA accesses which are likely to fault. Then exec_free_args() could avoid taking the map lock.

In D8921#184798, @kib wrote:

I will try to reply to both main comment and inline comments in one text, hopefully this is digestable.

First, execve(2) call in kernel has many (too much) sleep points. There is a lot of places where vm_map lock for the old/new vmspace is taken, both explicit in the kern_exec.c and imgact_elf.c code and implicit on page faults, binary and elf interpreter vnode locks. sleeps for the buffer locks and sf sleeps on 32bit arches. As indirect proof of my statement, please see r282708, which was very easy to trigger.

In other words, I think you only see the <= ncpu parallel execs because you most likely used make -j ncpu. Would the number of jobs increased, you would get more execs than cpus, and many of that threads appear sleeping inside the exec code, not at the outer boundary of strings copyin or copyout KVA allocations. More, I think that with the freshly allocated strings memory, which was not taked in the warmed state from the per-cpu cache, you would get page faults on the object and page allocations for exec map, re-introducing the exec map lock contention.

Note that the testing I reported included, e.g., a -j 128 buildkernel on a 16-core system. In that somewhat extreme case, the per-CPU cache still had a decent hit rate (close to 90%). However, that was with favourable conditions.

The above is simultaneously explanation for my request to not limit the exec map size to 2*ncpu areas, and to not consider e.g. map read lock taken to find the entry and backing object for the entry, as re-introducing much contention. It could be, but only for make -j ncpu. I agree that it must be not less than 2*ncpu, but IMO it should be larger.

Ok. I did some buildkernel testing with sleep points in kern_execve() instrumented and found that it's indeed quite common to end up sleeping on vnode locks (usually in the name cache or the image activator). It varies quite a lot depending on the setup - with NFS root, we sleep at some point in every execve, while with a local SSD, approximately 10% of execve calls sleep somewhere. So 2*ncpu now seems low to me as well.

make -j ncpu is the main target of this optimization, so I rather dislike the prospect of acquiring the map read lock in the fast path.

WRT caching the backing object together with the KVA, this might be a good solution. You could also store the object somewhere in struct image_params or image_args, filling the value nearby other strings KVA accesses which are likely to fault. Then exec_free_args() could avoid taking the map lock.

I'll try to find a clean way to do this. Do you consider this aspect (applying MADV_FREE) to be a prerequisite for this change?

In D8921#185180, @markj wrote:
In D8921#184798, @kib wrote:

WRT caching the backing object together with the KVA, this might be a good solution. You could also store the object somewhere in struct image_params or image_args, filling the value nearby other strings KVA accesses which are likely to fault. Then exec_free_args() could avoid taking the map lock.

I'll try to find a clean way to do this. Do you consider this aspect (applying MADV_FREE) to be a prerequisite for this change?

How about this: instead of adding extra complexity and shuffling to the fast path, add a lowmem handler that iterates over the cache entries and calls vm_map_madvise(MADV_FREE) on each one. Then they will be quickly reclaimed by the page daemon immediately after.

In D8921#185191, @markj wrote:
In D8921#185180, @markj wrote:
In D8921#184798, @kib wrote:

WRT caching the backing object together with the KVA, this might be a good solution. You could also store the object somewhere in struct image_params or image_args, filling the value nearby other strings KVA accesses which are likely to fault. Then exec_free_args() could avoid taking the map lock.

I'll try to find a clean way to do this. Do you consider this aspect (applying MADV_FREE) to be a prerequisite for this change?

How about this: instead of adding extra complexity and shuffling to the fast path, add a lowmem handler that iterates over the cache entries and calls vm_map_madvise(MADV_FREE) on each one. Then they will be quickly reclaimed by the page daemon immediately after.

Consider usual modern desktop with 8 hw threads. There, with the args size of ~200Kb, you get 1.5M of garbage cached. For larger server class machine with e.g. 32 threads, the lost memory is 6M. I tried to agree with this, but I do not like it.

Lets take a step out and look at the exec_map. It is relatively small, allocations from it are only done in the fixed size, and for single purpose. I believe that a viable strategy there to get rid of the vm_map lock contention, while also avoiding sinking in useless garbage, is to preallocate all args buffers KVA and their backing objects in advance. Since these structures are never freed, and the cache elements are never freed as well, it is very easy to implement lock-less list of the free elements, with fall back to the sleep wait in case of the cache depletion. Since vm_map_entries are stable, you can directly dereference map->object.vm_object for madvise.

Note that we do need exec_map to be real vm_map, so that vm_fault() worked.

markj edited edge metadata.

Preallocate exec args KVA during boot

Cache the corresponding map entries in a list protected by a mutex.
When entries are freed, vm_object_madvise() ensures that the backing
pages are marked clean and can be reclaimed quickly. The mutex also
synchronizes sleep/wakeup when the cache is empty, and a single-entry
per-CPU cache often lets us avoid the global free list entirely.

  • Use pmap_advise() to clear MD reference bits
  • Fix casts
kib edited edge metadata.

I am somewhat curious whether we still boot on very large modern machine with i386 kernel either normal or PAE pmap. I know that the stock PAE kernel has edge issues with 16G configs. I do not suggest you to actually test that.

sys/kern/kern_exec.c
1346 ↗(On Diff #23504)

Why do you need to manually preallocate the object ? Default object will be automatically created on the first page fault.

If you trying to save vm_map lock upgrades in the page faults, IMO it is not worth the efforts. There are quite small number of exec_map_entries, and they are instantiated quite rapidly.

This revision is now accepted and ready to land.Jan 1 2017, 10:53 AM
sys/kern/kern_exec.c
1346 ↗(On Diff #23504)

The object is actually otherwise created by the vm_map_lookup() call below. Here I'm trying to ensure that each KVA range gets its own object - vm_map_insert() will coalesce the ranges into a single entry if no object is specified, so all the exec map entries will belong to the same object and thus the vm_object_madvise() calls will all acquire the same write lock.

I think I'm trying overly hard to avoid global locks in exec_free_args_kva(), and I'm not succeeding anyway: pmap_advise() acquires the kernel pmap mutex, and vm_page_advise() acquires the inactive queue lock to requeue each page. For now I think I'll remove the vm_object preallocation and just use vm_map_advise() instead of pmap_advise()+vm_object_madvise(). vm_map_madvise(MADV_FREE) requires only the map read lock, and vm_object_madvise() can be optimized somewhat for my use case. For instance, I believe it only needs the write lock for OBJT_SWAP objects, and it can avoid a bunch of vm_page_lookup() calls on non-resident pages in the case that there's no backing object.

markj edited edge metadata.

Avoid keeping track of exec map entries and just use vm_map_madvise()

This revision now requires review to proceed.Jan 2 2017, 2:23 AM
kib edited edge metadata.
This revision is now accepted and ready to land.Jan 2 2017, 8:14 AM
sys/kern/kern_exec.c
1346 ↗(On Diff #23504)

You may avoid coalesce by creating vm_map_entries top down.

This revision was automatically updated to reflect the committed changes.