Page MenuHomeFreeBSD

vm_reserv: add reservation-aware UMA small_alloc
AbandonedPublic

Authored by bnovkov on May 1 2024, 1:54 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jul 23, 9:59 AM
Unknown Object (File)
Sat, Jul 6, 4:56 PM
Unknown Object (File)
Thu, Jul 4, 11:49 AM
Unknown Object (File)
Jun 13 2024, 6:29 PM
Unknown Object (File)
Jun 2 2024, 5:21 PM
Unknown Object (File)
May 27 2024, 2:14 AM
Unknown Object (File)
May 13 2024, 7:10 AM
Unknown Object (File)
May 10 2024, 3:49 AM
Subscribers

Details

Reviewers
kib
markj
alc
Summary

This patch adds a reservation-aware replacement for uma_small_alloc.

The vm_reserv_uma_small_{alloc, free} routines use unmanaged reservations to allocate 0-order pages for UMA zones.
Those reservations are placed in dedicated, per-domain UMA small_alloc queues that keep track of partially populated reservations and reservations used for NOFREE allocations.

The allocator falls back to vm_page_alloc_noobj_domain in case of memory pressure.

Test Plan

All changes in this patch series were tested on amd64 using a bhyve vm.
No errors or panics were encountered while running vm-related stress2 tests for several hours.

I can confirm that everything builds properly for all other architectures.
I'm currently in the process of running smoke tests for other architectures and will update the revision if anything pops up.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

Removed stray UMA code from diff.

sys/vm/vm_reserv.c
1762

vm_reserv_free_page_noobj() is used before declared (at least when D45043 is applied too).

Update vm_reserv_uma_small_alloc to retry if someone else filled up a partial reservation.

I spent some time evaluating this patch with four different metrics (thanks @markj for the first two):

  • Number of reservations with at least one NOFREE page
  • Number of reservations with at least one UMA slab, excluding NOFREE slabs
  • vm.pde.promotions
  • vm.pde.mappings

The first two metrics track how scattered these pages types are in memory. I've gathered them by iterating through all UMA zones and their slab lists.

I ran buildkernel four times in a row on a 10GB bhyve VM, and sampled the metrics before each run.

Results without patches:

no. NOFREEno. SLABvm.pde.promotionsvm.pde.mappings
4273720
46914421384914089
47614912738428317
48515304077942534

Results with patches:

no. NOFREEno. SLABvm.pde.promotionsvm.pde.mappings
1367711
2214411406614147
2214632780328456
2214684146042752

The amount reservations "tainted" by NOFREE pages is drastically smaller with this patch. It also remains stable throughout the test, whereas the non-patched version grows steadily.
This also seems to hold true for the second column which grows a lot less for the patched version, indicating that we are packing slab pages more efficiently.
The values for vm.pde.promotions and vm.pde.mappings are also slightly larger when compared to the unpatched version, although I'll re-evaluate this a couple more times to make sure.

I spent some time evaluating this patch with four different metrics (thanks @markj for the first two):

  • Number of reservations with at least one NOFREE page
  • Number of reservations with at least one UMA slab, excluding NOFREE slabs
  • vm.pde.promotions
  • vm.pde.mappings

Addendum - @netchild has been running -CURRENT with this patch series for some time with no issues encountered so far.
@netchild please correct me if I've misunderstood something, but the machine in question is a laptop with 8GB of RAM and it is primarily used to run various jailed services (~30 jails) and build packages.

I'd really love to get some feedback about the patch since it went through a couple of rounds of testing and appears to be both stable and effective at containing NOFREE and regular slabs.

Addendum - @netchild has been running -CURRENT with this patch series for some time with no issues encountered so far.
@netchild please correct me if I've misunderstood something, but the machine in question is a laptop with 8GB of RAM and it is primarily used to run various jailed services (~30 jails) and build packages.

Dual socket, 24 CPU (incl. HT), (old) Xeon server. 72 GB RAM. About 30 jails (not counting the poudriere jails). Various services, mysql, postgresql, redis, dns, ldap, various instances of php and nginx, various java services, ...

I haven't updated yet to the most recent update of this patch.

I've seen a lot of instabilities in other places (fixed based upon my reports), but the current BE I have is rock solid with this change (plus D45043 and D45045).

sys/vm/vm_reserv.c
223
225

May be move the sentence about 'two reservation queues' before listing them.

1459

We try to keep lock names under 6 symbols (for top).

1637

rv initialization seems to be not needed

1643

Why not just 'return (rv);'? The found label is not used for anything else.

1723

Does it make sense to move this code (starting from the 'Initialize page' comment) into a helper, placed and also used in vm_page.c?

bnovkov added inline comments.
sys/vm/vm_reserv.c
1643

You're right, that makes more sense.

1723

Well, I think it does and I have contemplated about doing so but ultimately gave up because I didn't want to drag out the patches too much.
But I agree that this is something worth doing, I'll try to land this in a separate revision.

kib added inline comments.
sys/vm/vm_reserv.c
1643
This revision is now accepted and ready to land.Jun 16 2024, 8:39 PM

Reading this and the related patches, I am a bit uncertain as to why this code needs to live in the vm_reserv layer. From my reading, UMA's small page allocator is effectively using this code to maintain a list of 2MB allocations from which we draw slab allocations; a separate list is used to segregate slabs for UMA_ZONE_NOFREE zones.

What problems does this vm_reserv extension solve? I can see that this will segregate NOFREE slabs, which is a good thing, but we don't need to modify the vm_reserv layer for that; UMA could easily maintain a linked list of 2MB chunks from which NOFREE slabs are allocated. Once a chunk is exhausted, try to allocate a new one and put it at the head of the list, otherwise fall back to regular 4KB slab allocations. The NOFREE case is simple and doesn't need a lot of machinery.

The change also segregates the remaining UMA slab allocations, but this is also provided by another mechanism: VM_FREEPOOL_DIRECT. UMA is the main consumer of pages from that freepool; I'd guess that page table page allocations are a distant second. So, we already have a soft mechanism to ensure that UMA's 4KB slabs don't fragment physical memory too much. This change introduces a stronger segregation. However,

  1. I don't see any mechanism to break unmanaged reservations if a 4KB page allocation fails. That is, the queue of partially populated unmanaged reservations might become quite large over time - how do we reclaim pages from it? Managed reservations and the freepool mechanism both allow unused 4KB pages to be repurposed, and I believe that's a necessary property. And, if you implement it for unmanaged reservations (maybe you already have and I missed), then why is this mechanism more effective than the existing VM_FREEPOOL_DIRECT?
  2. Do we have any data suggesting that using reservations improves contiguity of non-NOFREE slabs? The fact that VM_FREEPOOL_DIRECT includes e.g., page table pages suggests that there's room for improvement, but I'm a bit skeptical that it makes a significant difference.

Apologies in advance if I missed anything while reading the diffs, but my feeling now is that we'd get most of the benefit of this patch series by just segregating NOFREE slabs as a special case, and handling that entirely within UMA.

Reading this and the related patches, I am a bit uncertain as to why this code needs to live in the vm_reserv layer. From my reading, UMA's small page allocator is effectively using this code to maintain a list of 2MB allocations from which we draw slab allocations; a separate list is used to segregate slabs for UMA_ZONE_NOFREE zones.

What problems does this vm_reserv extension solve? I can see that this will segregate NOFREE slabs, which is a good thing, but we don't need to modify the vm_reserv layer for that; UMA could easily maintain a linked list of 2MB chunks from which NOFREE slabs are allocated. Once a chunk is exhausted, try to allocate a new one and put it at the head of the list, otherwise fall back to regular 4KB slab allocations. The NOFREE case is simple and doesn't need a lot of machinery.

Placing these changes in vm_reserv was done mostly to reuse the existing code and abstractions. But after reading your comments I agree, in its current state this could be moved to UMA entirely.
I'd originally meant for this to be plugged into vm_page_alloc_noobj (I've whipped up another patchset for this after reading your comments), but I guess I was too focused on the UMA case to see it all the way through.

The change also segregates the remaining UMA slab allocations, but this is also provided by another mechanism: VM_FREEPOOL_DIRECT. UMA is the main consumer of pages from that freepool; I'd guess that page table page allocations are a distant second. So, we already have a soft mechanism to ensure that UMA's 4KB slabs don't fragment physical memory too much. This change introduces a stronger segregation. However,

  1. I don't see any mechanism to break unmanaged reservations if a 4KB page allocation fails. That is, the queue of partially populated unmanaged reservations might become quite large over time - how do we reclaim pages from it?

We don't - thanks for catching this! I've overlooked that mechanism, but adding it in shouldn't be an issue.

Managed reservations and the freepool mechanism both allow unused 4KB pages to be repurposed, and I believe that's a necessary property. And, if you implement it for unmanaged reservations (maybe you already have and I missed), then why is this mechanism more effective than the existing VM_FREEPOOL_DIRECT?

This mechanism differs from VM_FREEPOOL_DIRECT in a few ways, but I think that the stronger segregation guarantees and the way pages are allocated are the most important differences.

Please correct me if I'm wrong, but from what I understand "regular" page allocations (i.e. not vm_page_alloc_noobj) might "steal" pages from the DIRECT freepool if the default one is empty. This cannot happen with these changes making it less likely that the "regular" and noobj allocations mix. You'd have to explicitly break a noobj reservation and release it back to the freelists to mix the two page allocation types. This can be a good or bad thing depending on what we are aiming for, since it does provide stronger segregation for the two types but also makes repurposing the unused pages a bit more complicated.

However, I think that the biggest difference here is that these changes will prioritize filling up partially populated reservations, which should pack things more tightly than VM_FREEPOOL_DIRECT. Allocating pages using VM_FREEPOOL_DIRECT will dequeue pages from the 0-order freelist, and there's no guarantee that the queued 0-order pages come from the same reservation.

I also think that there's a minor but noteworthy performance difference with this approach - we reduce lock contention for the vm_phys freelists.

  1. Do we have any data suggesting that using reservations improves contiguity of non-NOFREE slabs? The fact that VM_FREEPOOL_DIRECT includes e.g., page table pages suggests that there's room for improvement, but I'm a bit skeptical that it makes a significant difference.

I did some light testing and posted the results in a comment here, apologies if you've seen it already. As expected there's a drastic change for NOFREE slabs, but the non-NOFREE slab numbers are in the same ballpark.
I've also done a few rounds of testing with a modified vm_page_alloc_noobj that first tries to allocate a page using these changes before falling back to VM_FREEPOOL_DIRECT. The numbers were more or less similar and there wasn't a notable change. However, the benchmark I'm using (buildkernel x 4) may not be a good choice to test this thoroughly since ARC ends up using most of the noobj pages.

Apologies in advance if I missed anything while reading the diffs, but my feeling now is that we'd get most of the benefit of this patch series by just segregating NOFREE slabs as a special case, and handling that entirely within UMA.

Right, the way I see it we have a couple of options:

  1. Move everything into UMA and segregate NOFREE pages
  2. Same as 1. but with an additional zone flag to segregate long-lived pages (e.g. ARC)
  3. Make vm_page_alloc_noobj use these changes instead of UMA

Both VM_FREEPOOL_DIRECT and the proposed approach have their pros and cons, but I think applying the proposed approach to a few select cases could prove useful in the long run.
This is why I'm currently inclined to go with the second option as I think its important to segregate both NOFREE and long-lived page allocations (with ARC being a prime target for this), but I'd like to hear your thoughts before committing to anything.

The change also segregates the remaining UMA slab allocations, but this is also provided by another mechanism: VM_FREEPOOL_DIRECT. UMA is the main consumer of pages from that freepool; I'd guess that page table page allocations are a distant second. So, we already have a soft mechanism to ensure that UMA's 4KB slabs don't fragment physical memory too much. This change introduces a stronger segregation. However,

  1. I don't see any mechanism to break unmanaged reservations if a 4KB page allocation fails. That is, the queue of partially populated unmanaged reservations might become quite large over time - how do we reclaim pages from it?

We don't - thanks for catching this! I've overlooked that mechanism, but adding it in shouldn't be an issue.

Managed reservations and the freepool mechanism both allow unused 4KB pages to be repurposed, and I believe that's a necessary property. And, if you implement it for unmanaged reservations (maybe you already have and I missed), then why is this mechanism more effective than the existing VM_FREEPOOL_DIRECT?

This mechanism differs from VM_FREEPOOL_DIRECT in a few ways, but I think that the stronger segregation guarantees and the way pages are allocated are the most important differences.

Please correct me if I'm wrong, but from what I understand "regular" page allocations (i.e. not vm_page_alloc_noobj) might "steal" pages from the DIRECT freepool if the default one is empty. This cannot happen with these changes making it less likely that the "regular" and noobj allocations mix. You'd have to explicitly break a noobj reservation and release it back to the freelists to mix the two page allocation types. This can be a good or bad thing depending on what we are aiming for, since it does provide stronger segregation for the two types but also makes repurposing the unused pages a bit more complicated.

A running system will over time need to reclaim unused memory from unmanaged, partially populated reservations. Otherwise, UMA's consumers can internal fragmentation within the 2MB chunks such that a significant amount of RAM may be "free" but unusable by the rest of the system. This has to be addressed somehow. We can fix that by capping the total amount of RAM used for unmanaged reservations, or by having a facility to reclaim unused memory from partially populated unmanaged reservations. The former is difficult to do in a general way since UMA's memory consumption is completely workload-dependent, and the latter means we lose the "strong" segregation that you're talking about above.

However, I think that the biggest difference here is that these changes will prioritize filling up partially populated reservations, which should pack things more tightly than VM_FREEPOOL_DIRECT. Allocating pages using VM_FREEPOOL_DIRECT will dequeue pages from the 0-order freelist, and there's no guarantee that the queued 0-order pages come from the same reservation.

If a page is in a 0-order freelist, then its buddy is already allocated. (And that page has been in the freelist longer than all of the other free 0-order pages, so is not likely to see its buddy returned to the free lists in the near future.) vm_phys tries to import the largest possible chunk of contiguous memory into a pool when needed, so unless RAM is already very fragmented, the buddy will by allocated to another consumer of the same free pool, which in this case is likely to be UMA. It's not really obvious to me why this is objectively worse than the reservation-based scheme.

I also think that there's a minor but noteworthy performance difference with this approach - we reduce lock contention for the vm_phys freelists.

Hmm, I don't really follow why. What kind of workload is triggering sustained contention for vm_phys freelists? Keep in mind that:

  1. vm_page_alloc_noobj() tries to allocate 0-order pages from a per-CPU cache (actually I'm skeptical that this is of much use anymore).
  2. UMA does a lot of caching specifically to avoid going to the slab allocator. During initial rampup of a workload, we might indeed see many parallel slab allocations which hit the vm_phys freelists, but after that UMA should absorb the vast majority of its allocation requests (and if it doesn't handle that well, it's a bug in UMA).
  1. Do we have any data suggesting that using reservations improves contiguity of non-NOFREE slabs? The fact that VM_FREEPOOL_DIRECT includes e.g., page table pages suggests that there's room for improvement, but I'm a bit skeptical that it makes a significant difference.

I did some light testing and posted the results in a comment here, apologies if you've seen it already. As expected there's a drastic change for NOFREE slabs, but the non-NOFREE slab numbers are in the same ballpark.
I've also done a few rounds of testing with a modified vm_page_alloc_noobj that first tries to allocate a page using these changes before falling back to VM_FREEPOOL_DIRECT. The numbers were more or less similar and there wasn't a notable change. However, the benchmark I'm using (buildkernel x 4) may not be a good choice to test this thoroughly since ARC ends up using most of the noobj pages.

Apologies in advance if I missed anything while reading the diffs, but my feeling now is that we'd get most of the benefit of this patch series by just segregating NOFREE slabs as a special case, and handling that entirely within UMA.

Right, the way I see it we have a couple of options:

  1. Move everything into UMA and segregate NOFREE pages
  2. Same as 1. but with an additional zone flag to segregate long-lived pages (e.g. ARC)
  3. Make vm_page_alloc_noobj use these changes instead of UMA

Both VM_FREEPOOL_DIRECT and the proposed approach have their pros and cons, but I think applying the proposed approach to a few select cases could prove useful in the long run.
This is why I'm currently inclined to go with the second option as I think its important to segregate both NOFREE and long-lived page allocations (with ARC being a prime target for this), but I'd like to hear your thoughts before committing to anything.

My suggestion is to handle the easy, high-impact case first, i.e., try to fix NOFREE slab fragmentation within UMA. We already know that that alone benefits the system, and it can be done without modifying anything other than UMA.

I would be quite cautious about making assumptions about memory access patterns of something as general as the ZFS ARC. It supports many diverse workloads. ZFS has much more information about the history and future accesses of a given ARC buffer than UMA (or the page allocator) does, so to make improvements there we should leverage that information more. I think you are specifically looking at the ABD UMA zone (larger buffers are allocated via the kmem_* interface, which already makes use of superpage reservations); the ABD allocator just allocates one 4KB page at a time from UMA (see abd_alloc_chunks) - are there opportunities for it to convey more information that might help us make smarter allocation choices?

Thanks, your answers clear things up! I had an incomplete overview of the issue, I see now that its best too keep VM_FREEPOOL_DIRECT.

My suggestion is to handle the easy, high-impact case first, i.e., try to fix NOFREE slab fragmentation within UMA. We already know that that alone benefits the system, and it can be done without modifying anything other than UMA.

I would be quite cautious about making assumptions about memory access patterns of something as general as the ZFS ARC. It supports many diverse workloads. ZFS has much more information about the history and future accesses of a given ARC buffer than UMA (or the page allocator) does, so to make improvements there we should leverage that information more. I think you are specifically looking at the ABD UMA zone (larger buffers are allocated via the kmem_* interface, which already makes use of superpage reservations); the ABD allocator just allocates one 4KB page at a time from UMA (see abd_alloc_chunks) - are there opportunities for it to convey more information that might help us make smarter allocation choices?

Agreed. I'm abandoning the unmanaged reservation revisions and adding a UMA-only NOFREE one. I'll look into the ABD allocator separately.

However, I think that the biggest difference here is that these changes will prioritize filling up partially populated reservations, which should pack things more tightly than VM_FREEPOOL_DIRECT. Allocating pages using VM_FREEPOOL_DIRECT will dequeue pages from the 0-order freelist, and there's no guarantee that the queued 0-order pages come from the same reservation.

If a page is in a 0-order freelist, then its buddy is already allocated. (And that page has been in the freelist longer than all of the other free 0-order pages, so is not likely to see its buddy returned to the free lists in the near future.) vm_phys tries to import the largest possible chunk of contiguous memory into a pool when needed, so unless RAM is already very fragmented, the buddy will by allocated to another consumer of the same free pool, which in this case is likely to be UMA. It's not really obvious to me why this is objectively worse than the reservation-based scheme.

@markj What you say about the 0-order freelist is true. However, consider a scenario in which we have three 2MB chunks, call them A, B, and C, that are mostly allocated. Let's assume that they all have a single 4KB, 8KB, and 16KB chunk free, and that A is the oldest of the allocations and C is the youngest. The "problem" with a buddy allocator is that its next three 4KB allocations will come from the three different chunks, A, B, and C. More often than not, I do think that we would better off if we sought to fill A first and give C more time for additional chunks to be deallocated.