This revision is a quick-and-dirty approach to segregating the physical memory that underlies UMA _NOFREE objects, such as the vm object zone, so that permanent fragmentation is avoided.
Details
Diff Detail
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
Mark, I believe that you had a dtrace script for evaluating fragmentation. If this patch is effective, I'll try to come up with a commitable patch.
It's just a quick C program which can dump the slabs from the kegs of a specified zone. In some cases, (VTOSLAB == 0 and ppera == 1) the slab addresses are in the direct map and thus can be used to measure fragmentation of physical memory. For instance, with less than 18 hours uptime on my desktop, I get the following output for the VM object zone: https://reviews.freebsd.org/P200
Each line corresponds to a 2 MB page and the first column is the number of 4KB slabs allocated to the object keg from that 2MB page. In particular, objects are allocated across over 5000 2MB pages.
The program is in here: https://people.freebsd.org/~markj/umaslabs/
I'll test the patch on my desktop. At one point I wrote a virtually identical patch and found that it indeed helped a fair bit.
Obviously this isn't an apples-to-apples comparison, but things look much better with the patch so far: https://reviews.freebsd.org/P201
As one would expect, I'm also seeing some sharing among different NOFREE zones. For instance, the large page with 97 VM object slabs also contains nearly 200 thread slabs.
Fragmentation got substantially worse overnight, though it's still much better than before: https://reviews.freebsd.org/P203
With no evidence, my suspicion is that the cron job which updates the locate(1) database triggered many VM object allocations. By that point, my system was low on free pages, so large contiguous allocations weren't possible.
sys/malloc.h | ||
---|---|---|
63 | The "0" was to avoid conflict with the mbuf flag. Currently, I'm thinking of M_NEVERFREED as an alternative. Thoughts? |
For 32-bit machines, where we don't have a direct map, this pool-based implementation won't work without explicit allocation of KVA and mapping from that KVA to the allocated physical pages. Alternatively, we could try to achieve physical segregation through reservations rather than pooling. In other words, just use the kernel object, but segregate the M_NEVERFREED allocations through segregated ranges of page indices within the kernel object (and kernel virtual addresses). The disadvantage of this approach is that we lose any segregation whatsoever when an entire reservation can't be allocated, whereas the pooling-based approach will just give us lesser segregation.
sys/malloc.h | ||
---|---|---|
63 | M_PERMANENT is the only other alternative I can think of. M_NEVERFREED seems preferable to me. |
As I think about an implementation based on reservations rather than a pool, the arena parameter to kmem_malloc_domain() makes increasingly less sense to me. We already have the newish flag M_EXEC, and with the flag being proposed here, those flags can drive the selection of the arena rather than pointlessly passing in an arena pointer that we don't really use except as essentially a flag.
I would also point out that kmem_alloc_attr_domain() doesn't take an arena pointer as a parameter.
Here is just a portion of what a patch eliminating the arena parameter would look like. It eliminates pointless work from kmem_malloc_domain()'s callers.
Index: vm/uma_core.c =================================================================== --- vm/uma_core.c (revision 337447) +++ vm/uma_core.c (working copy) @@ -3680,32 +3680,22 @@ uma_zone_exhausted_nolock(uma_zone_t zone) void * uma_large_malloc_domain(vm_size_t size, int domain, int wait) { - struct vmem *arena; vm_offset_t addr; uma_slab_t slab; -#if VM_NRESERVLEVEL > 0 - if (__predict_true((wait & M_EXEC) == 0)) - arena = kernel_arena; - else - arena = kernel_rwx_arena; -#else - arena = kernel_arena; -#endif - slab = zone_alloc_item(slabzone, NULL, domain, wait); if (slab == NULL) return (NULL); if (domain == UMA_ANYDOMAIN) - addr = kmem_malloc(arena, size, wait); + addr = kmem_malloc(NULL, size, wait); else - addr = kmem_malloc_domain(arena, domain, size, wait); + addr = kmem_malloc_domain(domain, size, wait); if (addr != 0) { vsetslab(addr, slab); slab->us_data = (void *)addr; slab->us_flags = UMA_SLAB_KERNEL | UMA_SLAB_MALLOC; #if VM_NRESERVLEVEL > 0 - if (__predict_false(arena == kernel_rwx_arena)) + if (__predict_false(wait & M_EXEC)) slab->us_flags |= UMA_SLAB_KRWX; #endif slab->us_size = size; Index: vm/vm_kern.c =================================================================== --- vm/vm_kern.c (revision 337447) +++ vm/vm_kern.c (working copy) @@ -372,7 +372,7 @@ kmem_suballoc(vm_map_t parent, vm_offset_t *min, v * Allocate wired-down pages in the kernel's address space. */ vm_offset_t -kmem_malloc_domain(struct vmem *vmem, int domain, vm_size_t size, int flags) +kmem_malloc_domain(int domain, vm_size_t size, int flags) { vmem_t *arena; vm_offset_t addr; @@ -379,16 +379,11 @@ vm_offset_t int rv; #if VM_NRESERVLEVEL > 0 - KASSERT(vmem == kernel_arena || vmem == kernel_rwx_arena, - ("kmem_malloc_domain: Only kernel_arena or kernel_rwx_arena " - "are supported.")); - if (__predict_true(vmem == kernel_arena)) + if (__predict_true((flags & M_EXEC) == 0)) arena = vm_dom[domain].vmd_kernel_arena; else arena = vm_dom[domain].vmd_kernel_rwx_arena; #else - KASSERT(vmem == kernel_arena, - ("kmem_malloc_domain: Only kernel_arena is supported.")); arena = vm_dom[domain].vmd_kernel_arena; #endif size = round_page(size); @@ -404,7 +399,7 @@ vm_offset_t } vm_offset_t -kmem_malloc(struct vmem *vmem, vm_size_t size, int flags) +kmem_malloc(struct vmem *vmem __unused, vm_size_t size, int flags) { struct vm_domainset_iter di; vm_offset_t addr; @@ -412,7 +407,7 @@ vm_offset_t vm_domainset_iter_malloc_init(&di, kernel_object, &domain, &flags); do { - addr = kmem_malloc_domain(vmem, domain, size, flags); + addr = kmem_malloc_domain(domain, size, flags); if (addr != 0) break; } while (vm_domainset_iter_malloc(&di, &domain, &flags) == 0);
@alc you asked about a way to measure the fragmentation, is https://reviews.freebsd.org/D40575 maybe something which helps for this?
What's the current state of affairs for this review?
Can this still be used on -current and would you like to get some numbers from a memory constrained system and/or from a jail host with >20 jails with several DBs and webservers?
This doesn't apply to -current anymore, mainly in vm_page.h (all used upto 0x8000). I tried a naive / mechanical implementation without researching/understanding this part, by using 0x10000 for VM_ALLOC_NOFREE, and with VM_ALLOC_COUNT_MASK set to 0x1ffff, and VM_ALLOC_COUNT_SHIFT set to 17, but this panics.
Do you have an updated version of this? I would like to test what kind of effect this has in my fragmentation plots (see https://lists.freebsd.org/archives/freebsd-current/2024-May/005907.html).