Page MenuHomeFreeBSD

vmem: Avoid allocating span tags when segments are never released.
ClosedPublic

Authored by markj on Apr 23 2020, 4:33 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Dec 1, 7:18 AM
Unknown Object (File)
Oct 10 2024, 5:55 PM
Unknown Object (File)
Oct 1 2024, 2:37 PM
Unknown Object (File)
Sep 18 2024, 11:10 PM
Unknown Object (File)
Sep 5 2024, 6:22 AM
Unknown Object (File)
Sep 4 2024, 4:16 PM
Unknown Object (File)
Aug 18 2024, 10:49 PM
Unknown Object (File)
Aug 13 2024, 10:11 PM
Subscribers

Details

Summary

vmem uses span tags to delimit imported segments, so that they can be
released if the segment becomes free in the future. However, the
per-domain kernel KVA arenas never release resources. Furthermore, the
span tags prevent coalescing free segments across KVA_QUANTUM
boundaries. As a minor optimization, avoid allocating span tags in this
case.

Test Plan

This was motivated by looking at vm.pmap.kernel_pmaps during poudriere
runs. I see many runs of 511 4KB mappings. Since UMA uses the direct
map for page-sized slabs, most allocations into kernel_object are > 4KB,
so we end up with page-sized holes, inhibiting superpage promotion and
causing fragmentation since kernel_object reservations remain partially
populated. For example:

0xfffffe0215200000-0xfffffe02157ff000 rw-s- WB 0 2 511
0xfffffe0215800000-0xfffffe0215dff000 rw-s- WB 0 2 511
0xfffffe0215e00000-0xfffffe02163ff000 rw-s- WB 0 2 511
0xfffffe0216400000-0xfffffe02165ff000 rw-s- WB 0 0 511
0xfffffe0216600000-0xfffffe02167ff000 rw-s- WB 0 0 511
0xfffffe0216800000-0xfffffe02169ff000 rw-s- WB 0 0 511
0xfffffe0216a00000-0xfffffe0216dff000 rw-s- WB 0 1 511
0xfffffe0216e00000-0xfffffe02175ff000 rw-s- WB 0 3 511

I tried measuring 2MB mapping usage within the kernel map during the
first few minutes of a poudriere run.

Before: https://reviews.freebsd.org/P378
After: https://reviews.freebsd.org/P379

There are some other approaches that would also help:

  • Use a larger import quantum on platforms where KVA is cheap
  • Use the per-domain arenas to manage physical memory instead of KVA

The second would avoid creation of holes, but we'd still have internal
fragmentation due to the rarity of 4KB allocations. Coalescing across 2MB
boundaries would also be less likely to occur, and we would want some
mechanism to reclaim memory from the arenas during a severe shortage.

I still see a number of holes even with the patch applied, I'm not yet sure
why. It might be that something is occasionally allocating and freeing 4KB
of memory using kmem_malloc().

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 32957
Build 30351: arc lint + arc unit

Event Timeline

markj added reviewers: alc, kib, jeff.
markj added inline comments.
sys/kern/subr_vmem.c
804

The segment list is supposed to be sorted, but here we are assuming that a newly imported range always sorts to the end of the list, which was surprising to me. The vmem implementation in illumos seems to do the same thing. I can't see a cheap way to ensure that the new segment is sorted.

Before: https://reviews.freebsd.org/P378
After: https://reviews.freebsd.org/P379

I meant to note, the columns are the number of 1GB, 2MB and 4KB mappings in the kernel map, respectively.

The increase is larger than it appears: of the ~1100 2MB mappings that exist when the test is started, 858 are from the static mapping of vm_page_array.

I still see a number of holes even with the patch applied, I'm not yet sure why.

I spent some more time on this. It is simply a result of NUMA: adjacent 2MB virtual pages get allocated to different domains, so there is no possibility of coalescing KVA allocations. Since ZFS frequently allocates kmem buffers with a size not equal to a power of 2 (or even a sum of two consecutive powers of 2), we end up with many runs of 511 4KB pages in the kernel map. With NUMA disabled and this patch applied, we get very good superpage utilization in the kernel map when poudriere is running. I think KVA_QUANTUM should be larger than 2MB on NUMA systems to help mitigate the problem. I can't really see a downside to having a larger KVA_QUANTUM, except maybe that we waste kernel page table pages if the imported KVA is underutilized.

Assert that the arena is empty when setting import/release functions.
This is true for all consumers in the tree. It could be relaxed to only
require that the arena be empty if a release function is set.

I'd like to commit this in a couple of days if there are no objections.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 26 2020, 2:31 PM
This revision was automatically updated to reflect the committed changes.