vmem uses span tags to delimit imported segments, so that they can be
released if the segment becomes free in the future. However, the
per-domain kernel KVA arenas never release resources. Furthermore, the
span tags prevent coalescing free segments across KVA_QUANTUM
boundaries. As a minor optimization, avoid allocating span tags in this
case.
Details
This was motivated by looking at vm.pmap.kernel_pmaps during poudriere
runs. I see many runs of 511 4KB mappings. Since UMA uses the direct
map for page-sized slabs, most allocations into kernel_object are > 4KB,
so we end up with page-sized holes, inhibiting superpage promotion and
causing fragmentation since kernel_object reservations remain partially
populated. For example:
0xfffffe0215200000-0xfffffe02157ff000 rw-s- WB 0 2 511 0xfffffe0215800000-0xfffffe0215dff000 rw-s- WB 0 2 511 0xfffffe0215e00000-0xfffffe02163ff000 rw-s- WB 0 2 511 0xfffffe0216400000-0xfffffe02165ff000 rw-s- WB 0 0 511 0xfffffe0216600000-0xfffffe02167ff000 rw-s- WB 0 0 511 0xfffffe0216800000-0xfffffe02169ff000 rw-s- WB 0 0 511 0xfffffe0216a00000-0xfffffe0216dff000 rw-s- WB 0 1 511 0xfffffe0216e00000-0xfffffe02175ff000 rw-s- WB 0 3 511
I tried measuring 2MB mapping usage within the kernel map during the
first few minutes of a poudriere run.
Before: https://reviews.freebsd.org/P378
After: https://reviews.freebsd.org/P379
There are some other approaches that would also help:
- Use a larger import quantum on platforms where KVA is cheap
- Use the per-domain arenas to manage physical memory instead of KVA
The second would avoid creation of holes, but we'd still have internal
fragmentation due to the rarity of 4KB allocations. Coalescing across 2MB
boundaries would also be less likely to occur, and we would want some
mechanism to reclaim memory from the arenas during a severe shortage.
I still see a number of holes even with the patch applied, I'm not yet sure
why. It might be that something is occasionally allocating and freeing 4KB
of memory using kmem_malloc().
Diff Detail
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 32957 Build 30351: arc lint + arc unit
Event Timeline
sys/kern/subr_vmem.c | ||
---|---|---|
804 | The segment list is supposed to be sorted, but here we are assuming that a newly imported range always sorts to the end of the list, which was surprising to me. The vmem implementation in illumos seems to do the same thing. I can't see a cheap way to ensure that the new segment is sorted. |
Before: https://reviews.freebsd.org/P378
After: https://reviews.freebsd.org/P379
I meant to note, the columns are the number of 1GB, 2MB and 4KB mappings in the kernel map, respectively.
The increase is larger than it appears: of the ~1100 2MB mappings that exist when the test is started, 858 are from the static mapping of vm_page_array.
I still see a number of holes even with the patch applied, I'm not yet sure why.
I spent some more time on this. It is simply a result of NUMA: adjacent 2MB virtual pages get allocated to different domains, so there is no possibility of coalescing KVA allocations. Since ZFS frequently allocates kmem buffers with a size not equal to a power of 2 (or even a sum of two consecutive powers of 2), we end up with many runs of 511 4KB pages in the kernel map. With NUMA disabled and this patch applied, we get very good superpage utilization in the kernel map when poudriere is running. I think KVA_QUANTUM should be larger than 2MB on NUMA systems to help mitigate the problem. I can't really see a downside to having a larger KVA_QUANTUM, except maybe that we waste kernel page table pages if the imported KVA is underutilized.
Assert that the arena is empty when setting import/release functions.
This is true for all consumers in the tree. It could be relaxed to only
require that the arena be empty if a release function is set.