One way that I evaluated this change was to look at vm.stats.page. Specifically, I looked at the ratio of queue_noops to queue_ops. With this change, the ratio is reduced, but it is still the same order of magnitude. For example, after a series of buildworlds, I was seeing a ratio of 456 for a stock kernel, and 363 with this patch applied.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
Observe that I am not actually calling vm_page_dequeue() before calling vm_phys_free_pages() in various places. Instead, I am relying entirely on a call to vm_page_dequeue() from vm_freelist_add(). The argument for not unconditionally calling vm_page_dequeue() before calling vm_phys_free_pages() is that some fraction of the time, the dequeue can still be deferred because the page, or chunk of pages, that we are freeing will be the "right-hand" buddy to a page that is already in the buddy queues.
sys/vm/vm_page.c | ||
---|---|---|
2553 | The explanation for this call to vm_page_dequeue() is the same as that seen above in vm_page_alloc_noobj_domain(). |
sys/vm/vm_page.c | ||
---|---|---|
2553 | Actually, that explanation only partially fits this case, because we are pulling pages from the default pool. Nonetheless, we are allocating a chunk of pages, and the second, third, etc. pages in that chunk may still have pending dequeues. |
I speculate that the main source of additional queue_nops is partially populated reservations that only had a small number of populated pages and so the popcount reached zero and the reservation was returned to the buddy queues before the current batch of deferred dequeues, that includes the very first page in the reservation, hits the threshold for batched processing, so the vm_freelist_add() on that first page has to do a vm_page_dequeue().
To better understand the locking behavior, i.e., when a page queue lock is acquired while a free queues lock is held, I applied the following changes:
diff --git a/sys/vm/vm_phys.c b/sys/vm/vm_phys.c index 9261e52705fe..2c8eb510dbf5 100644 --- a/sys/vm/vm_phys.c +++ b/sys/vm/vm_phys.c @@ -389,12 +389,26 @@ sysctl_vm_phys_locality(SYSCTL_HANDLER_ARGS) } #endif +SYSCTL_NODE(_debug, OID_AUTO, counters, CTLFLAG_RD, 0, ""); + +static COUNTER_U64_DEFINE_EARLY(dequeues); +SYSCTL_COUNTER_U64(_debug_counters, OID_AUTO, dequeues, CTLFLAG_RD, &dequeues, + ""); + +static COUNTER_U64_DEFINE_EARLY(not_queued); +SYSCTL_COUNTER_U64(_debug_counters, OID_AUTO, not_queued, CTLFLAG_RD, ¬_queued, + ""); + static void vm_freelist_add(struct vm_freelist *fl, vm_page_t m, int order, int pool, int tail) { - if (__predict_false(vm_page_astate_load(m).queue != PQ_NONE)) + if (__predict_false(vm_page_astate_load(m).queue != PQ_NONE)) { vm_page_dequeue(m); + counter_u64_add(dequeues, 1); + } else { + counter_u64_add(not_queued, 1); + } m->order = order; m->pool = pool; if (tail) @@ -1216,6 +1230,15 @@ vm_phys_fictitious_unreg_range(vm_paddr_t start, vm_paddr_t end) free(seg, M_FICT_PAGES); } +static COUNTER_U64_DEFINE_EARLY(calls); +SYSCTL_COUNTER_U64(_debug_counters, OID_AUTO, calls, CTLFLAG_RD, &calls, + ""); + +static COUNTER_U64_DEFINE_EARLY(pending); +SYSCTL_COUNTER_U64(_debug_counters, OID_AUTO, pending, CTLFLAG_RD, &pending, + ""); + + /* * Free a contiguous, power of two-sized set of physical pages. * The pool field in the first page determines the destination pool. @@ -1256,8 +1279,11 @@ vm_phys_free_pages(vm_page_t m, int pool, int order) m = vm_phys_seg_paddr_to_vm_page(seg, pa); } while (order < VM_NFREEORDER - 1); } + if (__predict_false(vm_page_astate_load(m).queue != PQ_NONE)) + counter_u64_add(pending, 1); fl = (*seg->free_queues)[pool]; vm_freelist_add(fl, m, order, pool, 1); + counter_u64_add(calls, 1); } #ifdef VM_FREEPOOL_LAZYINIT
Here are some results after hours of continuous buildworld activity:
debug.counters.pending: 7629877 debug.counters.calls: 95555240 debug.counters.not_queued: 89571797 debug.counters.dequeues: 7630034
Observations:
- Less than one in twelve calls to vm_freelist_add() have to perform a dequeue.
- 99.998% of those calls to vm_freelist_add() are at the end of vm_phys_free_pages(). In other words, vm_phys_split_pages() almost never has to acquire a pagequeues lock to perform a dequeue.
- Less than one in thirteen calls to vm_freelist_add() at the end of vm_phys_free_pages() has to acquire a pagequeues lock to perform a dequeue. I think that this argues somewhat against unconditionally calling vm_page_dequeue() before calling vm_phys_free_pages(), like in D50333. (A stronger argument would be how often the freed chunk is the "right-hand" buddy of a chunk that is already in the buddy queues; in which case, we can let the delayed dequeue run its course.)
sys/vm/vm_page.h | ||
---|---|---|
219 | This comment is once again correct, so it doesn't need to change. :-) |
If I dramatically reduce the physical memory on the machine, so that reservations are rarely available, then even fewer calls to vm_freelist_add() have to perform a dequeue from a paging queue. the ratio is now one out of sixty:
debug.counters.pending: 15401029 debug.counters.calls: 918326383 debug.counters.not_queued: 914618225 debug.counters.dequeues: 15401153
I suspect that by the time UMA performs vm_page_zone_release(), any pending dequeues on the pages have completed.
Do you know what the bucket size is for the cache zones in that configuration? sysctl vm.uma.vm_pgcache.bucket_size will give the current value.
sys/vm/vm_page.c | ||
---|---|---|
2478 | Now that the buddy allocator returns pages that are definitely not on the page queues, we can remove the vm_page_dequeue() calls in vm_page_alloc_noobj*. | |
3980 | Does it make sense to go even further and inline this check into callers, to avoid function call overhead in the common case? Oh, I see you did that below. I'd maybe add a helper function or macro and use it in vm_page_alloc* as well, in which case this branch hint might not be worthwhile. | |
sys/vm/vm_page.h | ||
219 | Oops :) | |
232 | In the past I've brought up that I'd like to expand plinks by an extra 8 bytes, so that UMA can embed a struct uma_slab there on 64-bit systems when 1) the page is allocated via uma_small_alloc(), 2) the number of items per slab isn't larger than 32. A number of zones would benefit from this, particularly the ABD zone used in the ZFS ARC to allocate page-sized file data buffers. Do you have any particular thoughts or objections there? |
sys/vm/vm_page.h | ||
---|---|---|
232 | My preference would be to use the object field to provide the extra space. |
sys/vm/vm_page.h | ||
---|---|---|
232 | I don't think that will work by itself. The page daemon's scan assumes that the field is either null or a valid pointer to a VM object. In particular, the use of vm_page_dequeue() in vm_page_alloc* doesn't ensure that the page daemon isn't concurrently processing the page, and in particular, trying to lock its VM object. In practice such a race should be very rare since the page daemon typically will have observed that PGA_DEQUEUE is set and skipped the page before attempting to lock its object. |
sys/vm/vm_page.h | ||
---|---|---|
232 | Do similar objections apply to the pindex field? |
sys/vm/vm_page.h | ||
---|---|---|
232 | I think flipping the order of the pindex and object fields would solve this particular problem, and I can't think of any other reason overloading the pindex field would be a problem. I'll try writing a patch based on that observation. |
sys/vm/vm_page.c | ||
---|---|---|
3980 | The compiler is already inlining vm_page_dequeue() throughout vm_page.c, so I am less inclined to add a helper. | |
sys/vm/vm_page.h | ||
232 | @dougm should be suggesting that you make the pindex the first field in the struct. :-) The generic pctrie code expects that, and so the conversion of the vm_radix code to use the generic pctrie code required extra instructions to be generated because that is currently not the case. |
Did you figure out why killpg4.sh is hanging? I tried it locally with your patch and the test never terminates, but it's because I defined testuser=root (which I shouldn't do), and the test contains while [ ps -U$testuser | wc -l` -gt 1 ] ; do sleep 2; done`, which is always false for root. I don't see any sign of a problem in the kernel.
sys/vm/vm_phys.c | ||
---|---|---|
406 | I think this deserves a comment explaining how we try to preserve batching of dequeues. |
It always hangs for me. It has always hung for me. I haven't tried to diagnose it. I'm only ever running the tests as root. If I'm supposed to run the tests some other way, I'm not aware of it.
Before running stress2 tests you have to run make in the test suite, and as part of that you are prompted to provide a value for the testuser variable. That gets saved in the file /tmp/stress2.d/$(hostname). If the saved value of testuser is "root", killpg4 will never terminate. You're supposed to ensure that testuser is a non-root user.
If that's your regular user id, then the result is presumably the same. The killpg4.sh test assumes that nothing else on the system is running as that user. So, the problem isn't limited to "root" as I claimed earlier.
It is, and by picking instead the name of a user who hasn't had access for at least a decade, I was able to run killpg4.sh to completion. Thanks.
sys/vm/vm_page.c | ||
---|---|---|
2478 | So, should vm_phys_alloc_npages(), recognizing that it is doing pool switching, handle that, so that the dequeue can be dropped here? |
sys/vm/vm_page.c | ||
---|---|---|
2478 | Handling it in vm_phys_alloc_npages() would give less time for batching to take effect: when pages are allocated by vm_page_zone_import(), they do not need to be dequeued right away. That import function is just filling a per-CPU cache of free pages for later allocation. Handling it there would also increase the amount of work done under the vm_phys_free mutex, which we try to avoid. So far I don't see much of an advantage to that approach. | |
sys/vm/vm_page.h | ||
232 | It turns out that the idea is harder than I thought, for two reasons:
With pindex moved to offset 0, we have 24 bytes to work with, which isn't enough: the size of a uma_slab is always larger than 24 bytes. If we instead used a bitmap with 4-byte limbs, then that'd be enough only for a non-INVARIANTS kernel. I'd really prefer that this new method of storing the slab work with both INVARIANTS and non-INVARIANTS kernels though. I'm not yet sure how best to handle it. A few thoughts:
|
sys/vm/vm_page.h | ||
---|---|---|
232 | I tried implementing this, and the patches below survive some light testing: |
Yes please. Many of the inline accessors in vm_page.h, such as vm_page_all_valid(), will change after the listq removal, and various 3rd party kernel modules make use of them.
BTW, did you considered only marking the page for free if it is on the page queue, and doing the real free when processing the batch?
I think this could work as well, but is more complex to implement than this approach. It also could have some undesirable side effects: for example, after vm_object_terminate_pages() frees all of the pages, a small number of them could be left in a batch queue. Then vm_object_terminate() breaks all reservations belonging to the object, so if those pages belonged to reservations, we can't free full 2MB chunks. This could contribute to physical memory fragmentation.
Yes. I came to a similar conclusion as @markj did about it, which is not to say that we shouldn't try it.