Page MenuHomeFreeBSD

vm: Round up npages and alignment for contig reclamation
ClosedPublic

Authored by markj on Feb 25 2021, 6:05 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Dec 29, 8:15 PM
Unknown Object (File)
Dec 8 2024, 2:46 AM
Unknown Object (File)
Dec 3 2024, 11:25 PM
Unknown Object (File)
Dec 3 2024, 11:25 PM
Unknown Object (File)
Dec 3 2024, 11:24 PM
Unknown Object (File)
Dec 3 2024, 11:23 PM
Unknown Object (File)
Dec 3 2024, 11:23 PM
Unknown Object (File)
Dec 3 2024, 11:03 PM
Subscribers

Details

Summary

When searching for runs to reclaim, we need to ensure that the entire
run will be added to the buddy allocator as a single unit. Otherwise,
it will not be visible to vm_phys_alloc_contig().

Test Plan

Tried contigmalloc tests from stress2.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.Feb 25 2021, 6:05 PM
markj added reviewers: kib, alc, dougm.
markj added a subscriber: mav.

The rounding up should be capped at the largest supported buddy list order. (For allocation requests that are larger than the largest supported order, we do the following: For each block in the largest order list, we look at its successors in the vm_page array to see if a sufficient number of them are free to satisfy the request.)

Round up to at most a multiple of the maximum vm_phys chunk size.

sys/vm/vm_page.c
2980

I think that the rounding should be performed here, right after the KASSERT()s and before the sanity check on the number of free pages. The reason being that in the worst case vm_page_reclaim_run() will perform the rounded up number of page allocations before it frees any of the pages from the contiguous run.

3014–3015

Rather than "To ensure that runs are not fragmented, ...", I would say, "Due to limitations of vm_phys_alloc_contig(), ..."

markj marked 2 inline comments as done.
  • Update comment per feedback, fix a typo.
  • Move the adjustment to before the check of the free page count.
This revision is now accepted and ready to land.Feb 28 2021, 9:48 PM

mav@ could you please give your hack/test case to markj@. There should be a significant reduction in the amount scanning with this patch.

This is the patch I've used: https://people.freebsd.org/~mav/memfrag.patch . After boot run it with setting sysctl vm.fragment, depending on how often wired pages should go. I've set it to 4. After that try to receive something with Chelsio NIC with MTU 9000, so that it try to refill its receive queue and see how the receive interrupt threads explode. I've used iSCSI target writes, which additionally may keep buffers for some time while write commands are processed.

In D28924#649076, @mav wrote:

This is the patch I've used: https://people.freebsd.org/~mav/memfrag.patch . After boot run it with setting sysctl vm.fragment, depending on how often wired pages should go. I've set it to 4. After that try to receive something with Chelsio NIC with MTU 9000, so that it try to refill its receive queue and see how the receive interrupt threads explode. I've used iSCSI target writes, which additionally may keep buffers for some time while write commands are processed.

I tried this with a sysctl that allocates a number of 9KB clusters, I don't have any chelsio NICs available. I can confirm that the number of vm_phys_scan_contig() calls is vastly reduced with this patch. After memory is fragmented (no free chunks larger than 8KB available) but VPSC_NORESERV reclamations are still successful, it takes about 8s to allocate 100 clusters on a system with 64GB, vs. 2-2.5s with the patch applied.

it takes about 8s to allocate 100 clusters on a system with 64GB, vs. 2-2.5s with the patch applied.

It is good to hear, but still does not sound realistic for networking purposes. Plus my systems often have 256GB or more memory. Have you tried it together with your origial optimization patch?

For the cxgbe(4) purposes it would be the best if allocation could fail quickly without even trying to reclaim contiguous pages, since it is able to just fall back to page-sized clusters, but as I see now even M_NOWAIT tries to reclaim once. Obviously Intel drivers not having that fallback suffer from failure much more, but for those we forced 4KB allocations long ago, giving up on overhead, since they we used only at 10Gb/s or less.

This reminds me what (I think) I saw in OpenZFS on Linux, allocating memory for ABD buffers in arbitrary sizes chunks, as much as real memory fragmentation allows. On FreeBSD we now use fixed smaller page-sized chunks, that create additional management overhead. It would be good to be more flexible.

In D28924#649276, @mav wrote:

it takes about 8s to allocate 100 clusters on a system with 64GB, vs. 2-2.5s with the patch applied.

It is good to hear, but still does not sound realistic for networking purposes. Plus my systems often have 256GB or more memory. Have you tried it together with your origial optimization patch?

Right, this not expected to be a full solution to the problem. I will look more at preferentially reclaiming from the phys_segs corresponding to the default freelists, and ending the scan earlier.

I am wondering if the intent behind the current implementation is to provide a consistent runtime for reclamation. Suppose we started scanning from the beginning of physical memory and over time reclaimed more and more runs. Subsequent scans will take longer and longer since they always start from the same place. Perhaps we could maintain some cursor that gets updated after a scan and is used to mark the beginning of subsequent scans.

For the cxgbe(4) purposes it would be the best if allocation could fail quickly without even trying to reclaim contiguous pages, since it is able to just fall back to page-sized clusters, but as I see now even M_NOWAIT tries to reclaim once. Obviously Intel drivers not having that fallback suffer from failure much more, but for those we forced 4KB allocations long ago, giving up on overhead, since they we used only at 10Gb/s or less.

There was a recent commit adding M_NORECLAIM and VM_ALLOC_NORECLAIM, which has the behaviour you described. I am not sure that this is what you want: on every allocation attempt we will call vm_phys_alloc_contig() under a global (well, per-NUMA domain) lock before giving up, so it will still be expensive.

In the thread I suggested that the jumbo zones should transparently fall back to page-by-page allocation when contiguous allocations are not possible (or reclamation is too expensive). This would allow us to keep some of the advantages of jumbo clusters even when memory is fragmented. As Navdeep noted, some drivers assume that multi-page clusters are physically contiguous and this is required for DMA to some older NICs. I am not sure how best to handle this but I think it could be done if there is agreement that this is an acceptable path forward. DMA to contiguous buffers is more efficient, but is this a crucial property of the jumbo zones?

This reminds me what (I think) I saw in OpenZFS on Linux, allocating memory for ABD buffers in arbitrary sizes chunks, as much as real memory fragmentation allows. On FreeBSD we now use fixed smaller page-sized chunks, that create additional management overhead. It would be good to be more flexible.

"management overhead" meaning that we allocate an external uma_slab per (4KB) ABD buffer? There is some WIP to embed the slab header directly in the vm_page structure for exactly this reason. Or are you referring to something else?

As Navdeep noted, some drivers assume that multi-page clusters are physically contiguous and this is required for DMA to some older NICs. I am not sure how best to handle this but I think it could be done if there is agreement that this is an acceptable path forward. DMA to contiguous buffers is more efficient, but is this a crucial property of the jumbo zones?

I think many drivers may depend on contiguous clusters of fixed size for receive buffers, though I can't really speak about it. For transmit path I've recently started using huge non-contiguous clusters in iSCSI transmit code to dramatically reduce overhead, and aside of one bad data corruptions I've fixed in cxgb(4) I see only few ancient drivers that would have problems with that. At least nobody of TrueNAS community reported more problems.

I think the only safe migration path would be to create parallel set of cluster zones for use in driverd that really can handle them on receive, and accept that all code should support them on transmit.

This reminds me what (I think) I saw in OpenZFS on Linux, allocating memory for ABD buffers in arbitrary sizes chunks, as much as real memory fragmentation allows. On FreeBSD we now use fixed smaller page-sized chunks, that create additional management overhead. It would be good to be more flexible.

"management overhead" meaning that we allocate an external uma_slab per (4KB) ABD buffer? There is some WIP to embed the slab header directly in the vm_page structure for exactly this reason. Or are you referring to something else?

IIRC I saw both UMA and ZFS own iteration over the lists of 4KB chunks in profiles, but it was some time ago, so I don't remember much details.

In D28924#649289, @mav wrote:

This reminds me what (I think) I saw in OpenZFS on Linux, allocating memory for ABD buffers in arbitrary sizes chunks, as much as real memory fragmentation allows. On FreeBSD we now use fixed smaller page-sized chunks, that create additional management overhead. It would be good to be more flexible.

"management overhead" meaning that we allocate an external uma_slab per (4KB) ABD buffer? There is some WIP to embed the slab header directly in the vm_page structure for exactly this reason. Or are you referring to something else?

IIRC I saw both UMA and ZFS own iteration over the lists of 4KB chunks in profiles, but it was some time ago, so I don't remember much details.

Triggered by uma_drain() I suppose?

In D28924#649289, @mav wrote:

IIRC I saw both UMA and ZFS own iteration over the lists of 4KB chunks in profiles, but it was some time ago, so I don't remember much details.

Triggered by uma_drain() I suppose?

Draining is another pain point I hope to get to at some point, but now I meant just allocations/frees of zillions 4KB chunks when we are talking about 200-500GB of memory. As I have told, IIRC Linux uses variable chunk size, so general number chunks in the chain is lower.

In D28924#649276, @mav wrote:

it takes about 8s to allocate 100 clusters on a system with 64GB, vs. 2-2.5s with the patch applied.

It is good to hear, but still does not sound realistic for networking purposes. Plus my systems often have 256GB or more memory. Have you tried it together with your origial optimization patch?

Right, this not expected to be a full solution to the problem. I will look more at preferentially reclaiming from the phys_segs corresponding to the default freelists, and ending the scan earlier.

I am wondering if the intent behind the current implementation is to provide a consistent runtime for reclamation. Suppose we started scanning from the beginning of physical memory and over time reclaimed more and more runs. Subsequent scans will take longer and longer since they always start from the same place. Perhaps we could maintain some cursor that gets updated after a scan and is used to mark the beginning of subsequent scans.

No. To reclaim a run, vm_page_reclaim_run() is potentially allocating pages to copy any valid code/data pages from the run into. Through the use of appropriate parameters to vm_page_alloc_contig(), vm_page_reclaim_run() is trying to ensure that we are reclaiming memory at one end of the physical address space and allocating from the other. If we were to try to reclaim runs lower in the physical address space, then we might reduce the time spent scanning for runs at the expense of increasing the time spent in vm_page_alloc_contig() during vm_page_reclaim_run().

The scan for runs within a physical segment really needs to proceed from low to high physical addresses, since it relies on the way that the buddy allocator marks the page at the start of a free block. That said, one possibility might be to identify the largest eligible physical segment and start the scan at the end - X bytes, and if that fails to yield results start another scan at end - 2X bytes (ending at end - X), and so on.

Let me think about this until the weekend.