vm_phys: Ensure that pages are removed from pagequeues before freeing
AbandonedPublic
Actions

Authored by markj on May 13 2025, 5:26 PM.

Details

Reviewers

alc
dougm
kib

Summary

The buddy allocator currently uses the listq linkage in struct vm_page,
but we'd like to use plinks.q instead. However, pages are lazily
dequeued and we currently rely on vm_page_alloc* to call
vm_page_dequeue() to complete the removal.

Make dequeuing slightly less lazy without losing the benefits of
batching dequeues in the common case: ensure that freed pages are
dequeued before they are entered in the physical memory allocator. In
the common case, pages get freed to the per-CPU caches or to a
reservation, rather than directly to the buddy allocator, and those
mechanisms don't require overwriting plinks, so we still have some
opportunity to batch.

Add vm_page_dequeue() in the appropriate places, and remove some
existing calls in places where we know the page came from the buddy
allocator and thus isn't present on the page queues (note that
vm_page_alloc_check() verifies this). This requires a small
pessimization in vm_reserv_depopulate().

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 64394
Build 61278: arc lint + arc unit

Event Timeline

markj created this revision.May 13 2025, 5:26 PM

Herald added a subscriber: imp. · View Herald TranscriptMay 13 2025, 5:26 PM

markj requested review of this revision.May 13 2025, 5:26 PM

Harbormaster completed remote builds in B64143: Diff 155406.May 13 2025, 5:26 PM

kib accepted this revision.May 13 2025, 11:38 PM

This revision is now accepted and ready to land.May 13 2025, 11:38 PM

I think that there is another edge case that isn't being handled. Similar to my comment about the vm_page_dequeue after vm_reserv_alloc_page, suppose that a page is allocated from a reservation via vm_page_alloc(), validated, mapped, added to a paging queue, and then later freed back to the reservation. Now, suppose that the reservation is broken and the recently-freed page is passed to the buddy allocator. The lazy dequeue may not have completed yet. I think that reservation breaking will need to perform the vm_page_dequeue on each of the pages being passed to the buddy allocator.

sys/vm/vm_page.c
2112	I just want to write down why this dequeue is needed: in general, none of the pages within a reservation that was just allocated from the buddy allocator will be still in a paging queue. However, a page from the reservation may have been previously allocated via vm_page_alloc(), validated, mapped, added to a paging queue, and then later freed back to the reservation. Much later, that same page is reallocated from the reservation via vm_page_alloc(). At the time of reallocation, the lazy dequeue that was initiated when the page was previously freed back to the reservation might not have been completed yet, so we need to perform vm_page_dequeue() here. Is that what you had in mind?
sys/vm/vm_reserv.c
484–485	I fear that this is going to be painful performance-wise. There are going to be a lot of data cache misses, because this is going to oftentimes touch page structures that were never actually allocated from the reservation.

markj added inline comments.May 23 2025, 8:21 PM

sys/vm/vm_page.c
2112	Yes, exactly. I added a comment to this effect.
sys/vm/vm_reserv.c
484–485	Indeed, but I don't see much of an alternative. The immediate penalty might not be large because of prefetching, but it would still pollute the data cache. As an alternative to this whole approach, I think it'd be worth trying to lift this batching up to the few functions which ultimately free most managed pages. That is, vm_object_page_remove() and vm_object_terminate(). In particular, with the per-CPU caches, we already rely on consumers to free large batches of pages at once, otherwise the batched dequeuing wouldn't be effective as UMA per-CPU caches are LIFO.

Make sure to dequeue pages in vm_reserv_break() as well.
Add a comment explaining why we dequeue when allocating from a reservation.

This revision now requires review to proceed.May 23 2025, 8:22 PM

Harbormaster completed remote builds in B64394: Diff 155946.May 23 2025, 8:22 PM

alc added inline comments.May 24 2025, 7:03 PM

sys/vm/vm_reserv.c
484–485	I've been thinking about this, and I do see an alternative. Observe something that you did in vm_page_zone_release(). Specifically, that you potentially acquire a page queues lock while holding a free queues lock. This is not something that we've done before, but doing so is not entirely out of the question, unless it were to dramatically increase free queues lock contention. Now, instead of iterating over all of the pages of the reservation, performing vm_page_dequeue(), suppose that we only dequeue the very first page, since that is the only page whose plinks.q that will be used immediately. Until, or unless, the just-freed reservation is subject to vm_phys_split_pages(), the rest of the pages can still have their dequeues deferred, and hopefully by the time of a future split, all of the pending dequeues will have been completed. So, in short, the simple solution appears to be adding a call to vm_page_dequeue() to vm_freelist_add(). Most of the time, it will be a no-op. I seem to have a working version that passes all of the stress tests up to killpg4.sh, which never terminates when Doug tests it. That version also takes the liberty of removing the listq field from struct vm_page. I am going to post it shortly.

alc mentioned this in D50515: vm_page: retire its listq field.May 25 2025, 5:26 PM

markj added inline comments.May 25 2025, 10:37 PM

sys/vm/vm_reserv.c
484–485	I admit that I didn't think much about potentially increasing contention for the free page lock. The page queue locks are supposed to be leaf locks, so it's "ok" to acquire them with the free page lock held, but it's not necessarily a good idea. That said, I think that so long as the UMA cache zone bucket size is larger than page queue batch size, it's unlikely that this vm_page_dequeue() call will frequently acquire the page queue locks at all. This is because we call vm_page_dequeue_deferred() before freeing a page to the cache, and in steady-state operation the per-CPU free bucket needs to fill up before we release it back to the buddy allocator.