Paths

Table of Contentst

vm_page: Allow PG_NOFREE pages to be freed
ClosedPublic
Actions

Authored by markj on Mar 23 2025, 10:56 PM.

Details

Reviewers

alc
kib
bnovkov
dougm

Commits

rGae10431c9833: vm_page: Allow PG_NOFREE pages to be freed

Summary

There is at least one case where we need to support it: kmem_malloc()
might need to allocate multiple pages to satisfy a NOFREE allocation,
which it implements by calling vm_page_alloc() in a loop. If it fails
part-way though, it needs to free already-allocated pages, but this was
illegal.

Convert the bump allocator to a linked list; (ab)use the pindex field of
each page in the list to store the number of contiguous pages in the
block. (Originally I added a new plinks member for this purpose, but
it's not safe to use that until after vm_page_dequeue() is called due to
lazy page queue removal.) Then, modify vm_page_free() to support freeing
pages to this list.

Reported by: syzbot+93bc9edd2d0f22ae426a@syzkaller.appspotmail.com
Fixes: a8693e89e3e4 ("vm: Introduce vm_page_alloc_nofree_domain")

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

markj created this revision.Mar 23 2025, 10:56 PM

Herald added a subscriber: imp. · View Herald TranscriptMar 23 2025, 10:56 PM

markj requested review of this revision.Mar 23 2025, 10:56 PM

Harbormaster completed remote builds in B63087: Diff 152579.Mar 23 2025, 10:56 PM

dougm added inline comments.Mar 25 2025, 2:55 AM

sys/vm/vm_page.c
4148	With this change, everyone can free NOFREE memory. Is there some way to have a function specifically for freeing NOFREE memory so that this assertion can remain in place for freeing "normal" memory?

Do we want some counter to report how many non-freeable pages were freed? It is some kind of leak, I must admit.

The change looks reasonable, but I'm personally not that keen on introducing the ability to free NOFREE pages.
IMO the key issue here is that vm_page_alloc_nofree should not be used to allocate a batch of pages.

One alternative I can think of is to extend the existing NOFREE allocator with a batched allocation routine, similar to the one proposed in D47050.
This would allow us to easily roll the allocator state back if an allocation fails without having to introduce a way of freeing NOFREE pages.

sys/vm/vm_page.c
2651	Is the `__noinline` here on purpose or a leftover from a debugging session?

In D49480#1128510, @bnovkov wrote:

The change looks reasonable, but I'm personally not that keen on introducing the ability to free NOFREE pages.
IMO the key issue here is that vm_page_alloc_nofree should not be used to allocate a batch of pages.

I don't think that's necessarily true: suppose that we want to allocate a single NOFREE page and some other resource in that order, and the resource allocation fails.

One alternative I can think of is to extend the existing NOFREE allocator with a batched allocation routine, similar to the one proposed in D47050.
This would allow us to easily roll the allocator state back if an allocation fails without having to introduce a way of freeing NOFREE pages.

Indeed, I was wondering if the batch allocator interface would be useful here. One problem with that interface as implemented is that it can return partial allocations, but for NOFREE allocations in this particular case one really wants everything or nothing. Is that true for all NOFREE allocations which ask for multiple pages?

Note though that there are really two changes here: one to change the internals of the NOFREE queue such that pages can be put back into the queue, and one to enable freeing of NOFREE pages via vm_page_free(). The first one is desirable in any case, I think, as the batch allocator would need a way to undo a partial allocation. For instance, if the queue contains two free pages, and a caller asks for three, and the allocation of the third page fails, we'd need to put the first two back. In the current scheme that particular case is doable but a bit complicated; I think having an explicit queue is simpler.

In D49480#1128459, @kib wrote:

Do we want some counter to report how many non-freeable pages were freed? It is some kind of leak, I must admit.

Well, the pages are still available for future NOFREE allocations. Or do you think we should go as far as reconstructing the NOFREE import chunks and freeing them back to the buddy allocator when possible?

sys/vm/vm_page.c
2651	It's intentional but admittedly somewhat unrelated to the rest of the patch: I just wanted to ensure that it doesn't get inlined into its callers, as it's rarely executed but its callers are frequently executed. Same with vm_page_free_nofree().
4148	Yes, I think that's reasonable.

In D49480#1128882, @markj wrote:

Well, the pages are still available for future NOFREE allocations. Or do you think we should go as far as reconstructing the NOFREE import chunks and freeing them back to the buddy allocator when possible?

No, I mean that we should inform user that some pages are kept in this queue and cannot be reused for anything but another nofree alloc. If a situation arises when sum of all allocations does not equal to the total in some significant way, user should see where the page were accounted for.

In D49480#1128882, @markj wrote:

In D49480#1128510, @bnovkov wrote:

The change looks reasonable, but I'm personally not that keen on introducing the ability to free NOFREE pages.
IMO the key issue here is that vm_page_alloc_nofree should not be used to allocate a batch of pages.

I don't think that's necessarily true: suppose that we want to allocate a single NOFREE page and some other resource in that order, and the resource allocation fails.

Hm, I think we are talking about two separate problems here - allocating batches of nonfreeable pages vs. properly allocating co-dependent resouces when one of them is nonfreeable. The first one should be "all or nothing", and a good rule of thumb for the latter would be to grab the nonfreeable resource last.

One alternative I can think of is to extend the existing NOFREE allocator with a batched allocation routine, similar to the one proposed in D47050.
This would allow us to easily roll the allocator state back if an allocation fails without having to introduce a way of freeing NOFREE pages.

Indeed, I was wondering if the batch allocator interface would be useful here. One problem with that interface as implemented is that it can return partial allocations, but for NOFREE allocations in this particular case one really wants everything or nothing. Is that true for all NOFREE allocations which ask for multiple pages?

Right, the queue of pages simplifies state rollback and is worth implementing.
As for the batching interface, the implementation is not set in stone and we could easily enforce an "all or nothing" policy for allocations with VM_ALLOC_NOFREE set. kmem_malloc appears to be the only place where more than one NOFREE page might get allocated, so I think it would be better to enforce the "all or nothing" policy and avoid allowing NOFREE pages to be freed.

sys/vm/vm_page.c
2651	Got it, thanks.

In D49480#1130179, @bnovkov wrote:

In D49480#1128882, @markj wrote:

In D49480#1128510, @bnovkov wrote:

The change looks reasonable, but I'm personally not that keen on introducing the ability to free NOFREE pages.
IMO the key issue here is that vm_page_alloc_nofree should not be used to allocate a batch of pages.

I don't think that's necessarily true: suppose that we want to allocate a single NOFREE page and some other resource in that order, and the resource allocation fails.

Hm, I think we are talking about two separate problems here - allocating batches of nonfreeable pages vs. properly allocating co-dependent resouces when one of them is nonfreeable. The first one should be "all or nothing", and a good rule of thumb for the latter would be to grab the nonfreeable resource last.

I just meant to respond to your comment regarding the "key issue". Specifically, the problem can arise even for a single page allocation. There is a small bug of this kind already: suppose UMA's keg_alloc_slab() allocates a NOFREE page. Then, it invokes the zone's uk_init routine on all items in the slab. This uk_init routine is allowed to fail (though I don't know of any actual examples of this), in which case we will free the slab, which would result in UMA freeing a nofree page. More generally, it's not always practical to order resource allocations such that the nonfreeable resource comes last.

One alternative I can think of is to extend the existing NOFREE allocator with a batched allocation routine, similar to the one proposed in D47050.
This would allow us to easily roll the allocator state back if an allocation fails without having to introduce a way of freeing NOFREE pages.

Indeed, I was wondering if the batch allocator interface would be useful here. One problem with that interface as implemented is that it can return partial allocations, but for NOFREE allocations in this particular case one really wants everything or nothing. Is that true for all NOFREE allocations which ask for multiple pages?

Right, the queue of pages simplifies state rollback and is worth implementing.
As for the batching interface, the implementation is not set in stone and we could easily enforce an "all or nothing" policy for allocations with VM_ALLOC_NOFREE set. kmem_malloc appears to be the only place where more than one NOFREE page might get allocated, so I think it would be better to enforce the "all or nothing" policy and avoid allowing NOFREE pages to be freed.

That would fix this particular bug, but the problem in UMA is still there. I tend to think that we'll generally need some way to free these pages, even though doing so feels wrong since we call them "nofree".

In D49480#1130345, @markj wrote:

In D49480#1130179, @bnovkov wrote:

In D49480#1128882, @markj wrote:

In D49480#1128510, @bnovkov wrote:

The change looks reasonable, but I'm personally not that keen on introducing the ability to free NOFREE pages.
IMO the key issue here is that vm_page_alloc_nofree should not be used to allocate a batch of pages.

I don't think that's necessarily true: suppose that we want to allocate a single NOFREE page and some other resource in that order, and the resource allocation fails.

Hm, I think we are talking about two separate problems here - allocating batches of nonfreeable pages vs. properly allocating co-dependent resouces when one of them is nonfreeable. The first one should be "all or nothing", and a good rule of thumb for the latter would be to grab the nonfreeable resource last.

I just meant to respond to your comment regarding the "key issue". Specifically, the problem can arise even for a single page allocation. There is a small bug of this kind already: suppose UMA's keg_alloc_slab() allocates a NOFREE page. Then, it invokes the zone's uk_init routine on all items in the slab. This uk_init routine is allowed to fail (though I don't know of any actual examples of this), in which case we will free the slab, which would result in UMA freeing a nofree page. More generally, it's not always practical to order resource allocations such that the nonfreeable resource comes last.

One alternative I can think of is to extend the existing NOFREE allocator with a batched allocation routine, similar to the one proposed in D47050.
This would allow us to easily roll the allocator state back if an allocation fails without having to introduce a way of freeing NOFREE pages.

Indeed, I was wondering if the batch allocator interface would be useful here. One problem with that interface as implemented is that it can return partial allocations, but for NOFREE allocations in this particular case one really wants everything or nothing. Is that true for all NOFREE allocations which ask for multiple pages?

Right, the queue of pages simplifies state rollback and is worth implementing.
As for the batching interface, the implementation is not set in stone and we could easily enforce an "all or nothing" policy for allocations with VM_ALLOC_NOFREE set. kmem_malloc appears to be the only place where more than one NOFREE page might get allocated, so I think it would be better to enforce the "all or nothing" policy and avoid allowing NOFREE pages to be freed.

That would fix this particular bug, but the problem in UMA is still there. I tend to think that we'll generally need some way to free these pages, even though doing so feels wrong since we call them "nofree".

The problem sounds me as a usual transactional approach. We need to split the KPI into two parts: 1. allocate potentially non-freeable page 2. commit to the state where the allocated page is not freeable. I am not sure how much helpful would it be, since nofree allocation is special.

In D49480#1130357, @kib wrote:

In D49480#1130345, @markj wrote:

That would fix this particular bug, but the problem in UMA is still there. I tend to think that we'll generally need some way to free these pages, even though doing so feels wrong since we call them "nofree".

The problem sounds me as a usual transactional approach. We need to split the KPI into two parts: 1. allocate potentially non-freeable page 2. commit to the state where the allocated page is not freeable. I am not sure how much helpful would it be, since nofree allocation is special.

IMO this approach is too complex. There is no actual harm in freeing nofree pages. The only purpose of the mechanism is to reduce fragmentation, and it is done on a best-effort basis, e.g., if a VM_NOFREE_IMPORT_ORDER import fails, we simply fall back to the regular buddy allocator (whereas we could try successively smaller imports).

Right, the queue of pages simplifies state rollback and is worth implementing.
As for the batching interface, the implementation is not set in stone and we could easily enforce an "all or nothing" policy for allocations with VM_ALLOC_NOFREE set. kmem_malloc appears to be the only place where more than one NOFREE page might get allocated, so I think it would be better to enforce the "all or nothing" policy and avoid allowing NOFREE pages to be freed.

That would fix this particular bug, but the problem in UMA is still there. I tend to think that we'll generally need some way to free these pages, even though doing so feels wrong since we call them "nofree".

Right, restructuring UMA's allocation call chain would likely be painful.
We should certainly revisit this discussion if a second use-case of this kind comes up, but I think your patch is perfectly fine until then.

sys/vm/vm_pagequeue.h
250	Minor nit, just so we're consistent with the naming scheme.

Fix the vmd field name.
Count pages in the nofreeq that aren't yet allocated.

Harbormaster completed remote builds in B63334: Diff 153176.Apr 5 2025, 4:35 PM

markj added inline comments.Apr 5 2025, 4:35 PM

sys/vm/vm_page.c
4148	After the intervening discussion, I think it'd be better to simply let vm_page_free() handle NOFREE pages. I've changed the way the nofree counter works such that we track the number of pages in `vmd_nofreeq`, so we can see whether an abnormal pages of queued there.