Page MenuHomeFreeBSD

uma: embed uma slab in vm_page
Needs ReviewPublic

Authored by rlibby on Dec 11 2019, 7:34 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Apr 14, 6:05 AM
Unknown Object (File)
Sun, Apr 14, 4:44 AM
Unknown Object (File)
Apr 12 2024, 12:00 PM
Unknown Object (File)
Feb 14 2024, 8:26 PM
Unknown Object (File)
Dec 23 2023, 4:41 PM
Unknown Object (File)
Nov 22 2023, 11:07 AM
Unknown Object (File)
Nov 22 2023, 11:03 AM
Unknown Object (File)
Nov 21 2023, 7:30 PM
Subscribers

Details

Reviewers
jeff
markj
Summary

Embed uma slabs in a vm_page to reduce internal fragmentation in the
slab. This is especially helpful for power-of-2 allocation sizes.

This is currently a work-in-progress. It needs a rename pass, a tunable
to disable it, and more KASSERTs about PG_OPAQUE. But the basic idea is
here.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 28084
Build 26229: arc lint + arc unit

Event Timeline

sys/vm/uma_int.h
300

There are several possible slab formats and flags used to control how the item address to slab pointer translation occurs. They are as follows:

vtoslab - This does not define a slab format but may be used with any format to translate a virtual address to a slab pointer by way vm structures. On systems with a direct map and single page allocations this is simple math. Without a direct map or with multiple page allocations we have to walk the page tables in pmap_kextract() to discover the physical address of the page so that we can look it up in the page array. This flag is compatible with any keg that allocates memory from the virtual memory system.

Inline - The slab header at the end of the allocated slab at offset uk_pgoff from the beginning of the first item. This is the cheapest lookup as we only have to mask address bits and add however it may waste space for some item sizes and it only works if the first item is virtually aligned according to the number of pages-per-allocation. In practice this means we only use this method for 1 page slabs.

Embedded - The slab header is embedded within the vm page itself. Many page fields are unused on direct map systems when we do kernel memory allocations without an object. Currently this only applies to single page allocations, above that and object linkage is required. This leaves the full space in the allocated slab available for user allocations. These slabs may only be found with vtoslab().

Offpage - The slab header is allocated from a separate zone and looked up with either vtoslab or a hash, described below. Offpage allocations are used when the other formats will result in too much wasted space. Typically this means we could allocate an extra item if we did not consume slab space for the header but the required slab header will not fit in the embedded page.

hash - Hash zones are only necessary when UMA can not directly access memory or resolve it with vtoslab(). It is possible to allocate physical addresses or memory in this fashion or any regularly spaced numerical interval provided by a custom page allocator (uk_allocf). Hash zones use offpage slabs with hash linkage to resolve pointer size numbers to slab structures.

ed note - I think some of these flags could better represent actual consumers. We have a relationship between format and lookup method that is somewhat weakly specified. I think the current flags may be ok for UMA internal use but externally the user should specify what their constraints are and let UMA figure it out.

sys/vm/vm_page.h
241–243

If you merge up to head these three fields are collapsed into a single 32bit entry. Moving this and ref_count up above opaque_end would give you an additional 64bit of embed slab space. So 128 bit total on non-debug kernels.

sys/vm/vm_page.c
2485

Per my comment above, I can't see why this change is required.

sys/vm/vm_page.h
241–243

If you move busy_lock instead of ref_count, you can preserve the property that slab pages are wired, so existing checks for wired pages, e.g., in the contig scan below are sufficient. If we maintain the current property of slab pages being wired and unmanaged ((oflags & VPO_UNMANAGED) != 0) I think we should be able to avoid needing any checks for PG_OPAQUE outside of UMA.

sys/vm/vm_page.c
2485

Yes, it was anticipating taking over the ref_count field. I will investigate further on the page organization. I agree it would be better if we could rely on existing checks.

sys/vm/vm_page.h
241–243

That doesn't work with lockless page lookup. Busy may be modified and rolled back.

Since wired is predicated on busy it makes sense architecturally. We could also restructure the code in contig to trybusy before checking wired which would work with the proposed restructuring. I don't mind the flag however, because it gives a clear thing to assert on and check for if we find a weird looking page.

sys/vm/vm_page.c
2485

If you go ahead with the current approach, I believe vm_page_reclaim_run() needs to be modified as well. This function returns runs of candidate pages which must be rechecked before reclamation can happen.

sys/vm/vm_page.h
241–243

We already re-check wired after tryxbusy succeeds, because we have to. We can eliminate the wired check before the tryxbusy and just live with unnecessary xbusy/xunbusy operations, which I think is what you are proposing. We could check for and skip VPO_UNMANAGED as a cheap substitute.

I like having the PG_OPAQUE flag for the reasons you mentioned, I would just prefer not to have to add extra checks in the VM to accommodate new uses of vm_page.

sys/vm/vm_page.c
2485

Note this one doesn't inspect busy. So we would have to reserve and define wired for it to work here.

sys/vm/vm_page.h
241–243

It's ok to check wired in this case. We just can't rely on it being set or we can't use the field for the slab header.

I think the physical defrag related functions are the only ones that should care. If we have to check the flag in a lot of places I would agree.

We could also move astate + a few of the one byte fields and valid/dirty but that seems grosser to me.

sys/vm/vm_page.c
2485

Not sure what you mean? Line 2530 checks vm_page_busied(m).

sys/vm/vm_page.h
241–243

There is a further complication here in that queue scans may operate on freed pages. In particular the active queue scan assumes that m->object is either a valid object pointer or NULL. So we need to introduce checks in those paths as well. I can't really see a clean way to do it other than explicitly looking for PG_OPAQUE: we can't look at aflags if astate becomes part of the opaque region, and we can't rely on the vm_page_wired() (and that check will go away soon anyway).

sys/vm/vm_page.c
2485

Long after looking at object which we are overwriting.

sys/vm/vm_page.h
241–243

I thought we forced the queues to be coherent by the time they were allocated again?

sys/vm/vm_page.c
2485

I see now. Yes, you cannot rely on busy here if ref_count moves.

sys/vm/vm_page.h
241–243

Yes a newly allocated page does not belong to any queue, but there is nothing synchronizing the page daemon with a given page's queue state since the page daemon does not hold a reference which prevents a given page's reuse. Page identity is only stable wrt queue scans once the page daemon acquires the object lock. The active queue scan can make no guarantees as to the state of the vm_page. It only takes care not to modify that state if it does not match what is expected.

sys/vm/vm_page.h
241–243

So it looks to me like the page scans will operate on freed pages but not on freshly reallocated pages. That is, if the batch has not yet been processed, we force the issue in vm_page_alloc().

This means that by the time the page gets to UMA we can guarantee that it's not in a batch or queue right? We should just be able to assert the aflags are as we expect before we overwrite any fields.

sys/vm/vm_page.h
241–243

No, that is not true. Consider how the scan works: vm_pageout_next() collects a batch of pages under the page queue lock. Then the lock is released and we scan the pages with no locks held. This relies on the type-stability of vm_page structures, and the stability of several fields in the structure: ref_count (only until my page lock work is committed), object, astate, and md (active queue scan only).

In order for UMA to use these fields it must somehow synchronize with the page daemon at page allocation time. We don't have any mechanism to do that anymore. I can imagine a few ways to go about this without imposing too much overhead in the non-UMA case, but I'm still a bit fuzzy on exactly how much space UMA wants and what fields we are willing and able to reorder to get some of that space without changing anything else.

sys/vm/vm_page.h
241–243

I understand now. Thank you.

UMA really wants another 8 bytes. It would be easy but somewhat error prone to always check busy first in these cases where we rely on type stability. That would be consistent with the lockless branch. I need to look at what the practical implications would be elsewhere. Just doing a PG flag won't solve the race for pageout scan.

The MD field is also large but that would mean we'd have to audit all uma_small_alloc pmaps to make sure it was safe to overwrite md and the size would be variable so I don't like it at all.

We don't need to re-order anything for this to be committed and useful. I proposed to Ryan that we get this first diff polished and in before worrying about finding more bits. We need to make sure that we're free from races with the existing code however. Because the whole scan is racy you could possibly be preempted after checking flags and wired and then resume once uma has taken over the page and the object field is re-used for a non-object which you would then try to lock.

I started looking at this revision again. Is there any reason we can't simply re-order pindex and object to give 40 bytes of opaque space at the beginning of the vm_page structure? With the current layout of uma_slab that would leave us with 16 bytes for the free bitmask, enough to track 128 items. It's not perfect, but it's good enough for almost all zones, and ZFS will benefit quite a bit since it makes heavy use of zones with ipers = 1.