Paths

Table of Contentst

vm_page: retire its listq field
ClosedPublic
Actions

Authored by alc on May 24 2025, 7:11 PM.

Details

Reviewers

dougm
kib
markj

Commits

rG1fee99800a79: vm_page: Retire its listq field

Test Plan

One way that I evaluated this change was to look at vm.stats.page. Specifically, I looked at the ratio of queue_noops to queue_ops. With this change, the ratio is reduced, but it is still the same order of magnitude. For example, after a series of buildworlds, I was seeing a ratio of 456 for a stock kernel, and 363 with this patch applied.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

alc requested review of this revision.May 24 2025, 7:11 PM

alc created this revision.

Observe that I am not actually calling vm_page_dequeue() before calling vm_phys_free_pages() in various places. Instead, I am relying entirely on a call to vm_page_dequeue() from vm_freelist_add(). The argument for not unconditionally calling vm_page_dequeue() before calling vm_phys_free_pages() is that some fraction of the time, the dequeue can still be deferred because the page, or chunk of pages, that we are freeing will be the "right-hand" buddy to a page that is already in the buddy queues.

sys/vm/vm_page.c
2553	The explanation for this call to vm_page_dequeue() is the same as that seen above in vm_page_alloc_noobj_domain().

alc edited the test plan for this revision. (Show Details)May 24 2025, 7:24 PM

alc added inline comments.May 24 2025, 7:31 PM

sys/vm/vm_page.c
2553	Actually, that explanation only partially fits this case, because we are pulling pages from the default pool. Nonetheless, we are allocating a chunk of pages, and the second, third, etc. pages in that chunk may still have pending dequeues.

I speculate that the main source of additional queue_nops is partially populated reservations that only had a small number of populated pages and so the popcount reached zero and the reservation was returned to the buddy queues before the current batch of deferred dequeues, that includes the very first page in the reservation, hits the threshold for batched processing, so the vm_freelist_add() on that first page has to do a vm_page_dequeue().

To better understand the locking behavior, i.e., when a page queue lock is acquired while a free queues lock is held, I applied the following changes:

diff --git a/sys/vm/vm_phys.c b/sys/vm/vm_phys.c
index 9261e52705fe..2c8eb510dbf5 100644
--- a/sys/vm/vm_phys.c
+++ b/sys/vm/vm_phys.c
@@ -389,12 +389,26 @@ sysctl_vm_phys_locality(SYSCTL_HANDLER_ARGS)
 }
 #endif
 
+SYSCTL_NODE(_debug, OID_AUTO, counters, CTLFLAG_RD, 0, "");
+
+static COUNTER_U64_DEFINE_EARLY(dequeues);
+SYSCTL_COUNTER_U64(_debug_counters, OID_AUTO, dequeues, CTLFLAG_RD, &dequeues,
+    "");
+
+static COUNTER_U64_DEFINE_EARLY(not_queued);
+SYSCTL_COUNTER_U64(_debug_counters, OID_AUTO, not_queued, CTLFLAG_RD, &not_queued,
+    "");
+
 static void
 vm_freelist_add(struct vm_freelist *fl, vm_page_t m, int order, int pool,
     int tail)
 {
-       if (__predict_false(vm_page_astate_load(m).queue != PQ_NONE))
+       if (__predict_false(vm_page_astate_load(m).queue != PQ_NONE)) {
                vm_page_dequeue(m);
+               counter_u64_add(dequeues, 1);
+       } else {
+               counter_u64_add(not_queued, 1);
+       }
        m->order = order;
        m->pool = pool;
        if (tail)
@@ -1216,6 +1230,15 @@ vm_phys_fictitious_unreg_range(vm_paddr_t start, vm_paddr_t end)
        free(seg, M_FICT_PAGES);
 }
 
+static COUNTER_U64_DEFINE_EARLY(calls);
+SYSCTL_COUNTER_U64(_debug_counters, OID_AUTO, calls, CTLFLAG_RD, &calls,
+    "");
+
+static COUNTER_U64_DEFINE_EARLY(pending);
+SYSCTL_COUNTER_U64(_debug_counters, OID_AUTO, pending, CTLFLAG_RD, &pending,
+    "");
+
+
 /*
  * Free a contiguous, power of two-sized set of physical pages.
  * The pool field in the first page determines the destination pool.
@@ -1256,8 +1279,11 @@ vm_phys_free_pages(vm_page_t m, int pool, int order)
                        m = vm_phys_seg_paddr_to_vm_page(seg, pa);
                } while (order < VM_NFREEORDER - 1);
        }
+       if (__predict_false(vm_page_astate_load(m).queue != PQ_NONE))
+               counter_u64_add(pending, 1);
        fl = (*seg->free_queues)[pool];
        vm_freelist_add(fl, m, order, pool, 1);
+       counter_u64_add(calls, 1);
 }
 
 #ifdef VM_FREEPOOL_LAZYINIT

Here are some results after hours of continuous buildworld activity:

debug.counters.pending: 7629877
debug.counters.calls: 95555240
debug.counters.not_queued: 89571797
debug.counters.dequeues: 7630034

Observations:

Less than one in twelve calls to vm_freelist_add() have to perform a dequeue.
99.998% of those calls to vm_freelist_add() are at the end of vm_phys_free_pages(). In other words, vm_phys_split_pages() almost never has to acquire a pagequeues lock to perform a dequeue.
Less than one in thirteen calls to vm_freelist_add() at the end of vm_phys_free_pages() has to acquire a pagequeues lock to perform a dequeue. I think that this argues somewhat against unconditionally calling vm_page_dequeue() before calling vm_phys_free_pages(), like in D50333. (A stronger argument would be how often the freed chunk is the "right-hand" buddy of a chunk that is already in the buddy queues; in which case, we can let the delayed dequeue run its course.)

sys/vm/vm_page.h
219	This comment is once again correct, so it doesn't need to change. :-)

If I dramatically reduce the physical memory on the machine, so that reservations are rarely available, then even fewer calls to vm_freelist_add() have to perform a dequeue from a paging queue. the ratio is now one out of sixty:

debug.counters.pending: 15401029
debug.counters.calls: 918326383
debug.counters.not_queued: 914618225
debug.counters.dequeues: 15401153

I suspect that by the time UMA performs vm_page_zone_release(), any pending dequeues on the pages have completed.

In D50515#1153272, @alc wrote:
If I dramatically reduce the physical memory on the machine, so that reservations are rarely available, then even fewer calls to vm_freelist_add() have to perform a dequeue from a paging queue. the ratio is now one out of sixty:
debug.counters.pending: 15401029
debug.counters.calls: 918326383
debug.counters.not_queued: 914618225
debug.counters.dequeues: 15401153
I suspect that by the time UMA performs vm_page_zone_release(), any pending dequeues on the pages have completed.

Do you know what the bucket size is for the cache zones in that configuration? sysctl vm.uma.vm_pgcache.bucket_size will give the current value.

sys/vm/vm_page.c
2478	Now that the buddy allocator returns pages that are definitely not on the page queues, we can remove the vm_page_dequeue() calls in vm_page_alloc_noobj*.
3980	Does it make sense to go even further and inline this check into callers, to avoid function call overhead in the common case? Oh, I see you did that below. I'd maybe add a helper function or macro and use it in vm_page_alloc* as well, in which case this branch hint might not be worthwhile.
sys/vm/vm_page.h
219	Oops :)
232	In the past I've brought up that I'd like to expand `plinks` by an extra 8 bytes, so that UMA can embed a `struct uma_slab` there on 64-bit systems when 1) the page is allocated via uma_small_alloc(), 2) the number of items per slab isn't larger than 32. A number of zones would benefit from this, particularly the ABD zone used in the ZFS ARC to allocate page-sized file data buffers. Do you have any particular thoughts or objections there?

In D50515#1153287, @markj wrote:
In D50515#1153272, @alc wrote:
If I dramatically reduce the physical memory on the machine, so that reservations are rarely available, then even fewer calls to vm_freelist_add() have to perform a dequeue from a paging queue. the ratio is now one out of sixty:
debug.counters.pending: 15401029
debug.counters.calls: 918326383
debug.counters.not_queued: 914618225
debug.counters.dequeues: 15401153
I suspect that by the time UMA performs vm_page_zone_release(), any pending dequeues on the pages have completed.
Do you know what the bucket size is for the cache zones in that configuration? sysctl vm.uma.vm_pgcache.bucket_size will give the current value.

vm.uma.vm_pgcache.bucket_size: 40

alc added inline comments.May 26 2025, 12:01 AM

sys/vm/vm_page.c
2478	There is an edge case involving vm_page_zone_import() where the page might still have a pending dequeue. Suppose vm_phys_alloc_npages() has to steal a chunk from the default pool. The 2nd, 3rd, etc. pages in that chunk might still have pending dequeues.
3980	Yes, I think so.

alc added inline comments.May 26 2025, 7:14 AM

sys/vm/vm_page.h
232	My preference would be to use the `object` field to provide the extra space.

markj added inline comments.May 26 2025, 2:00 PM

sys/vm/vm_page.h
232	I don't think that will work by itself. The page daemon's scan assumes that the field is either null or a valid pointer to a VM object. In particular, the use of vm_page_dequeue() in vm_page_alloc* doesn't ensure that the page daemon isn't concurrently processing the page, and in particular, trying to lock its VM object. In practice such a race should be very rare since the page daemon typically will have observed that PGA_DEQUEUE is set and skipped the page before attempting to lock its object.

dougm added inline comments.May 26 2025, 4:00 PM

sys/vm/vm_page.h
232	Do similar objections apply to the pindex field?

markj added inline comments.May 26 2025, 4:05 PM

sys/vm/vm_page.h
232	I think flipping the order of the pindex and object fields would solve this particular problem, and I can't think of any other reason overloading the pindex field would be a problem. I'll try writing a patch based on that observation.

alc marked an inline comment as done.May 26 2025, 7:37 PM

alc added inline comments.

sys/vm/vm_page.c
3980	The compiler is already inlining vm_page_dequeue() throughout vm_page.c, so I am less inclined to add a helper.
sys/vm/vm_page.h
232	@dougm should be suggesting that you make the pindex the first field in the struct. :-) The generic pctrie code expects that, and so the conversion of the vm_radix code to use the generic pctrie code required extra instructions to be generated because that is currently not the case.

Did you figure out why killpg4.sh is hanging? I tried it locally with your patch and the test never terminates, but it's because I defined testuser=root (which I shouldn't do), and the test contains while [ ps -U$testuser | wc -l` -gt 1 ] ; do sleep 2; done`, which is always false for root. I don't see any sign of a problem in the kernel.

sys/vm/vm_phys.c
406	I think this deserves a comment explaining how we try to preserve batching of dequeues.

In D50515#1153696, @markj wrote:

Did you figure out why killpg4.sh is hanging?

It always hangs for me. It has always hung for me. I haven't tried to diagnose it. I'm only ever running the tests as root. If I'm supposed to run the tests some other way, I'm not aware of it.

In D50515#1153706, @dougm wrote:

In D50515#1153696, @markj wrote:

Did you figure out why killpg4.sh is hanging?

It always hangs for me. It has always hung for me. I haven't tried to diagnose it. I'm only ever running the tests as root. If I'm supposed to run the tests some other way, I'm not aware of it.

Before running stress2 tests you have to run make in the test suite, and as part of that you are prompted to provide a value for the testuser variable. That gets saved in the file /tmp/stress2.d/$(hostname). If the saved value of testuser is "root", killpg4 will never terminate. You're supposed to ensure that testuser is a non-root user.

In D50515#1153880, @markj wrote:

Before running stress2 tests you have to run make in the test suite, and as part of that you are prompted to provide a value for the testuser variable. That gets saved in the file /tmp/stress2.d/$(hostname). If the saved value of testuser is "root", killpg4 will never terminate. You're supposed to ensure that testuser is a non-root user.

grep testuser /tmp/stress2.d/hostname

testuser=dougm

In D50515#1153914, @dougm wrote:

In D50515#1153880, @markj wrote:

Before running stress2 tests you have to run make in the test suite, and as part of that you are prompted to provide a value for the testuser variable. That gets saved in the file /tmp/stress2.d/$(hostname). If the saved value of testuser is "root", killpg4 will never terminate. You're supposed to ensure that testuser is a non-root user.

grep testuser /tmp/stress2.d/hostname

testuser=dougm

If that's your regular user id, then the result is presumably the same. The killpg4.sh test assumes that nothing else on the system is running as that user. So, the problem isn't limited to "root" as I claimed earlier.

In D50515#1153918, @markj wrote:

If that's your regular user id, then the result is presumably the same. The killpg4.sh test assumes that nothing else on the system is running as that user. So, the problem isn't limited to "root" as I claimed earlier.

It is, and by picking instead the name of a user who hasn't had access for at least a decade, I was able to run killpg4.sh to completion. Thanks.

Update comments.

dougm added inline comments.May 28 2025, 7:00 AM

sys/vm/vm_page.c
2478	So, should vm_phys_alloc_npages(), recognizing that it is doing pool switching, handle that, so that the dequeue can be dropped here?

markj accepted this revision.Jun 1 2025, 9:49 PM

markj added inline comments.

sys/vm/vm_page.c
2478	Handling it in vm_phys_alloc_npages() would give less time for batching to take effect: when pages are allocated by vm_page_zone_import(), they do not need to be dequeued right away. That import function is just filling a per-CPU cache of free pages for later allocation. Handling it there would also increase the amount of work done under the vm_phys_free mutex, which we try to avoid. So far I don't see much of an advantage to that approach.
sys/vm/vm_page.h
232	It turns out that the idea is harder than I thought, for two reasons: There is a 4-byte hole between `us_domain` and `us_free` in struct uma_slab on 64-bit platforms. I had assumed that those pad bytes would instead hold a 32-bit bitmap, but the BIT_* macros used for the bitmap operate on arrays of type `long`, which is 64-bits wide. In INVARIANTS kernels, there is an additional bitmap following the first, used for use-after-free detection. With `pindex` moved to offset 0, we have 24 bytes to work with, which isn't enough: the size of a uma_slab is always larger than 24 bytes. If we instead used a bitmap with 4-byte limbs, then that'd be enough only for a non-INVARIANTS kernel. I'd really prefer that this new method of storing the slab work with both INVARIANTS and non-INVARIANTS kernels though. I'm not yet sure how best to handle it. A few thoughts: It would be nice to keep using bitset.h, but it's wasteful to leave those pad bytes unused. I think most of the bitset macros can be extended to transparently handle both int-sized and long-sized limbs (since they can test `sizeof((p)->__bits[0])` at compile time) without generating extra code. I'm not sure if this is a good idea though: `BITSET_SIZE()`, for instance, can't be extended without adding an extra parameter. If we re-add one of the pointer fields removed by this patch, we'd have more space to work with. Alternately, we could move the `ref_count` and `busy_count` fields of struct vm_page to immediately after `plinks`, and allow them to be overwritten as well as `pindex`. uma_slab would be embedded in vm_pages allocated by uma_small_alloc(), and there the values of `ref_count` and `busy_count` are fixed.

This revision is now accepted and ready to land.Jun 1 2025, 9:49 PM

dougm accepted this revision.Jun 1 2025, 11:42 PM

markj added inline comments.Jun 2 2025, 1:56 AM

sys/vm/vm_page.h
232	I tried implementing this, and the patches below survive some light testing: https://reviews.freebsd.org/D50639 https://reviews.freebsd.org/D50640 https://reviews.freebsd.org/D50641

Should I bump __FreeBSD_version after this change?

@kib Do you have any comments?

In D50515#1157445, @alc wrote:

Should I bump __FreeBSD_version after this change?

Yes please. Many of the inline accessors in vm_page.h, such as vm_page_all_valid(), will change after the listq removal, and various 3rd party kernel modules make use of them.

BTW, did you considered only marking the page for free if it is on the page queue, and doing the real free when processing the batch?

In D50515#1157527, @kib wrote:

BTW, did you considered only marking the page for free if it is on the page queue, and doing the real free when processing the batch?

I think this could work as well, but is more complex to implement than this approach. It also could have some undesirable side effects: for example, after vm_object_terminate_pages() frees all of the pages, a small number of them could be left in a batch queue. Then vm_object_terminate() breaks all reservations belonging to the object, so if those pages belonged to reservations, we can't free full 2MB chunks. This could contribute to physical memory fragmentation.

In D50515#1157527, @kib wrote:

BTW, did you considered only marking the page for free if it is on the page queue, and doing the real free when processing the batch?

Yes. I came to a similar conclusion as @markj did about it, which is not to say that we shouldn't try it.

Closed by commit rG1fee99800a79: vm_page: Retire its listq field (authored by alc). · Explain WhyJun 8 2025, 6:36 PM

This revision was automatically updated to reflect the committed changes.

alc added a commit: rG1fee99800a79: vm_page: Retire its listq field.

Herald added a subscriber: imp. · View Herald TranscriptJun 8 2025, 6:36 PM