Page MenuHomeFreeBSD

vm_page: Allow PG_NOFREE pages to be freed
ClosedPublic

Authored by markj on Mar 23 2025, 10:56 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Sep 30, 4:28 PM
Unknown Object (File)
Wed, Sep 24, 3:10 AM
Unknown Object (File)
Wed, Sep 24, 1:49 AM
Unknown Object (File)
Mon, Sep 22, 8:26 AM
Unknown Object (File)
Mon, Sep 22, 5:47 AM
Unknown Object (File)
Sun, Sep 21, 3:47 PM
Unknown Object (File)
Thu, Sep 18, 7:58 PM
Unknown Object (File)
Sep 16 2025, 7:44 AM
Subscribers

Details

Summary

There is at least one case where we need to support it: kmem_malloc()
might need to allocate multiple pages to satisfy a NOFREE allocation,
which it implements by calling vm_page_alloc() in a loop. If it fails
part-way though, it needs to free already-allocated pages, but this was
illegal.

Convert the bump allocator to a linked list; (ab)use the pindex field of
each page in the list to store the number of contiguous pages in the
block. (Originally I added a new plinks member for this purpose, but
it's not safe to use that until after vm_page_dequeue() is called due to
lazy page queue removal.) Then, modify vm_page_free() to support freeing
pages to this list.

Reported by: syzbot+93bc9edd2d0f22ae426a@syzkaller.appspotmail.com
Fixes: a8693e89e3e4 ("vm: Introduce vm_page_alloc_nofree_domain")

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/vm/vm_page.c
4148

With this change, everyone can free NOFREE memory. Is there some way to have a function specifically for freeing NOFREE memory so that this assertion can remain in place for freeing "normal" memory?

Do we want some counter to report how many non-freeable pages were freed? It is some kind of leak, I must admit.

The change looks reasonable, but I'm personally not that keen on introducing the ability to free NOFREE pages.
IMO the key issue here is that vm_page_alloc_nofree should not be used to allocate a batch of pages.

One alternative I can think of is to extend the existing NOFREE allocator with a batched allocation routine, similar to the one proposed in D47050.
This would allow us to easily roll the allocator state back if an allocation fails without having to introduce a way of freeing NOFREE pages.

sys/vm/vm_page.c
2651

Is the __noinline here on purpose or a leftover from a debugging session?

The change looks reasonable, but I'm personally not that keen on introducing the ability to free NOFREE pages.
IMO the key issue here is that vm_page_alloc_nofree should not be used to allocate a batch of pages.

I don't think that's necessarily true: suppose that we want to allocate a single NOFREE page and some other resource in that order, and the resource allocation fails.

One alternative I can think of is to extend the existing NOFREE allocator with a batched allocation routine, similar to the one proposed in D47050.
This would allow us to easily roll the allocator state back if an allocation fails without having to introduce a way of freeing NOFREE pages.

Indeed, I was wondering if the batch allocator interface would be useful here. One problem with that interface as implemented is that it can return partial allocations, but for NOFREE allocations in this particular case one really wants everything or nothing. Is that true for all NOFREE allocations which ask for multiple pages?

Note though that there are really two changes here: one to change the internals of the NOFREE queue such that pages can be put back into the queue, and one to enable freeing of NOFREE pages via vm_page_free(). The first one is desirable in any case, I think, as the batch allocator would need a way to undo a partial allocation. For instance, if the queue contains two free pages, and a caller asks for three, and the allocation of the third page fails, we'd need to put the first two back. In the current scheme that particular case is doable but a bit complicated; I think having an explicit queue is simpler.

In D49480#1128459, @kib wrote:

Do we want some counter to report how many non-freeable pages were freed? It is some kind of leak, I must admit.

Well, the pages are still available for future NOFREE allocations. Or do you think we should go as far as reconstructing the NOFREE import chunks and freeing them back to the buddy allocator when possible?

sys/vm/vm_page.c
2651

It's intentional but admittedly somewhat unrelated to the rest of the patch: I just wanted to ensure that it doesn't get inlined into its callers, as it's rarely executed but its callers are frequently executed. Same with vm_page_free_nofree().

4148

Yes, I think that's reasonable.

Well, the pages are still available for future NOFREE allocations. Or do you think we should go as far as reconstructing the NOFREE import chunks and freeing them back to the buddy allocator when possible?

No, I mean that we should inform user that some pages are kept in this queue and cannot be reused for anything but another nofree alloc. If a situation arises when sum of all allocations does not equal to the total in some significant way, user should see where the page were accounted for.

The change looks reasonable, but I'm personally not that keen on introducing the ability to free NOFREE pages.
IMO the key issue here is that vm_page_alloc_nofree should not be used to allocate a batch of pages.

I don't think that's necessarily true: suppose that we want to allocate a single NOFREE page and some other resource in that order, and the resource allocation fails.

Hm, I think we are talking about two separate problems here - allocating batches of nonfreeable pages vs. properly allocating co-dependent resouces when one of them is nonfreeable. The first one should be "all or nothing", and a good rule of thumb for the latter would be to grab the nonfreeable resource last.

One alternative I can think of is to extend the existing NOFREE allocator with a batched allocation routine, similar to the one proposed in D47050.
This would allow us to easily roll the allocator state back if an allocation fails without having to introduce a way of freeing NOFREE pages.

Indeed, I was wondering if the batch allocator interface would be useful here. One problem with that interface as implemented is that it can return partial allocations, but for NOFREE allocations in this particular case one really wants everything or nothing. Is that true for all NOFREE allocations which ask for multiple pages?

Right, the queue of pages simplifies state rollback and is worth implementing.
As for the batching interface, the implementation is not set in stone and we could easily enforce an "all or nothing" policy for allocations with VM_ALLOC_NOFREE set. kmem_malloc appears to be the only place where more than one NOFREE page might get allocated, so I think it would be better to enforce the "all or nothing" policy and avoid allowing NOFREE pages to be freed.

sys/vm/vm_page.c
2651

Got it, thanks.

The change looks reasonable, but I'm personally not that keen on introducing the ability to free NOFREE pages.
IMO the key issue here is that vm_page_alloc_nofree should not be used to allocate a batch of pages.

I don't think that's necessarily true: suppose that we want to allocate a single NOFREE page and some other resource in that order, and the resource allocation fails.

Hm, I think we are talking about two separate problems here - allocating batches of nonfreeable pages vs. properly allocating co-dependent resouces when one of them is nonfreeable. The first one should be "all or nothing", and a good rule of thumb for the latter would be to grab the nonfreeable resource last.

I just meant to respond to your comment regarding the "key issue". Specifically, the problem can arise even for a single page allocation. There is a small bug of this kind already: suppose UMA's keg_alloc_slab() allocates a NOFREE page. Then, it invokes the zone's uk_init routine on all items in the slab. This uk_init routine is allowed to fail (though I don't know of any actual examples of this), in which case we will free the slab, which would result in UMA freeing a nofree page. More generally, it's not always practical to order resource allocations such that the nonfreeable resource comes last.

One alternative I can think of is to extend the existing NOFREE allocator with a batched allocation routine, similar to the one proposed in D47050.
This would allow us to easily roll the allocator state back if an allocation fails without having to introduce a way of freeing NOFREE pages.

Indeed, I was wondering if the batch allocator interface would be useful here. One problem with that interface as implemented is that it can return partial allocations, but for NOFREE allocations in this particular case one really wants everything or nothing. Is that true for all NOFREE allocations which ask for multiple pages?

Right, the queue of pages simplifies state rollback and is worth implementing.
As for the batching interface, the implementation is not set in stone and we could easily enforce an "all or nothing" policy for allocations with VM_ALLOC_NOFREE set. kmem_malloc appears to be the only place where more than one NOFREE page might get allocated, so I think it would be better to enforce the "all or nothing" policy and avoid allowing NOFREE pages to be freed.

That would fix this particular bug, but the problem in UMA is still there. I tend to think that we'll generally need some way to free these pages, even though doing so feels wrong since we call them "nofree".

The change looks reasonable, but I'm personally not that keen on introducing the ability to free NOFREE pages.
IMO the key issue here is that vm_page_alloc_nofree should not be used to allocate a batch of pages.

I don't think that's necessarily true: suppose that we want to allocate a single NOFREE page and some other resource in that order, and the resource allocation fails.

Hm, I think we are talking about two separate problems here - allocating batches of nonfreeable pages vs. properly allocating co-dependent resouces when one of them is nonfreeable. The first one should be "all or nothing", and a good rule of thumb for the latter would be to grab the nonfreeable resource last.

I just meant to respond to your comment regarding the "key issue". Specifically, the problem can arise even for a single page allocation. There is a small bug of this kind already: suppose UMA's keg_alloc_slab() allocates a NOFREE page. Then, it invokes the zone's uk_init routine on all items in the slab. This uk_init routine is allowed to fail (though I don't know of any actual examples of this), in which case we will free the slab, which would result in UMA freeing a nofree page. More generally, it's not always practical to order resource allocations such that the nonfreeable resource comes last.

One alternative I can think of is to extend the existing NOFREE allocator with a batched allocation routine, similar to the one proposed in D47050.
This would allow us to easily roll the allocator state back if an allocation fails without having to introduce a way of freeing NOFREE pages.

Indeed, I was wondering if the batch allocator interface would be useful here. One problem with that interface as implemented is that it can return partial allocations, but for NOFREE allocations in this particular case one really wants everything or nothing. Is that true for all NOFREE allocations which ask for multiple pages?

Right, the queue of pages simplifies state rollback and is worth implementing.
As for the batching interface, the implementation is not set in stone and we could easily enforce an "all or nothing" policy for allocations with VM_ALLOC_NOFREE set. kmem_malloc appears to be the only place where more than one NOFREE page might get allocated, so I think it would be better to enforce the "all or nothing" policy and avoid allowing NOFREE pages to be freed.

That would fix this particular bug, but the problem in UMA is still there. I tend to think that we'll generally need some way to free these pages, even though doing so feels wrong since we call them "nofree".

The problem sounds me as a usual transactional approach. We need to split the KPI into two parts: 1. allocate potentially non-freeable page 2. commit to the state where the allocated page is not freeable. I am not sure how much helpful would it be, since nofree allocation is special.

In D49480#1130357, @kib wrote:

That would fix this particular bug, but the problem in UMA is still there. I tend to think that we'll generally need some way to free these pages, even though doing so feels wrong since we call them "nofree".

The problem sounds me as a usual transactional approach. We need to split the KPI into two parts: 1. allocate potentially non-freeable page 2. commit to the state where the allocated page is not freeable. I am not sure how much helpful would it be, since nofree allocation is special.

IMO this approach is too complex. There is no actual harm in freeing nofree pages. The only purpose of the mechanism is to reduce fragmentation, and it is done on a best-effort basis, e.g., if a VM_NOFREE_IMPORT_ORDER import fails, we simply fall back to the regular buddy allocator (whereas we could try successively smaller imports).

Right, the queue of pages simplifies state rollback and is worth implementing.
As for the batching interface, the implementation is not set in stone and we could easily enforce an "all or nothing" policy for allocations with VM_ALLOC_NOFREE set. kmem_malloc appears to be the only place where more than one NOFREE page might get allocated, so I think it would be better to enforce the "all or nothing" policy and avoid allowing NOFREE pages to be freed.

That would fix this particular bug, but the problem in UMA is still there. I tend to think that we'll generally need some way to free these pages, even though doing so feels wrong since we call them "nofree".

Right, restructuring UMA's allocation call chain would likely be painful.
We should certainly revisit this discussion if a second use-case of this kind comes up, but I think your patch is perfectly fine until then.

sys/vm/vm_pagequeue.h
250

Minor nit, just so we're consistent with the naming scheme.

markj marked 4 inline comments as done.
  • Fix the vmd field name.
  • Count pages in the nofreeq that aren't yet allocated.
sys/vm/vm_page.c
4148

After the intervening discussion, I think it'd be better to simply let vm_page_free() handle NOFREE pages. I've changed the way the nofree counter works such that we track the number of pages in vmd_nofreeq, so we can see whether an abnormal pages of queued there.

kib added inline comments.
sys/vm/vm_page.c
2696

Might be, instead of 'doing so' say something like 'put the pages aside for other nofree allocs'.

This revision is now accepted and ready to land.Apr 6 2025, 4:10 AM

Make a comment more clear.

This revision now requires review to proceed.Apr 7 2025, 12:04 PM
This revision is now accepted and ready to land.Apr 7 2025, 10:38 PM
This revision was automatically updated to reflect the committed changes.