Page MenuHomeFreeBSD

vm_page: Implement lazy page initialization
Needs ReviewPublic

Authored by markj on Jun 2 2023, 7:36 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Sep 22, 5:45 PM
Unknown Object (File)
Sat, Sep 16, 9:06 AM
Unknown Object (File)
Wed, Sep 6, 8:18 AM
Unknown Object (File)
Tue, Sep 5, 10:49 PM
Unknown Object (File)
Sun, Sep 3, 7:52 PM
Unknown Object (File)
Aug 18 2023, 11:59 AM
Unknown Object (File)
Aug 14 2023, 2:32 PM
Unknown Object (File)
Aug 14 2023, 2:24 AM

Details

Summary

FreeBSD's boot times have decreased to the point where vm_page array
initialization represents a significant fraction of the total boot time.
For example, when booting FreeBSD in Firecracker (a VMM designed to
support lightweight VMs) with 128MB and 1GB of RAM, vm_page
initialization consumes 9% (3ms) and 37% (21.5ms) of the kernel boot
time, respectively. This is generally relevant in cloud environments,
where one wants to be able to spin up VMs as quickly as possible.

This patch implements lazy initialization of (most) page structures,
following a suggestion from cperciva@. The idea is to introduce a new
free pool, VM_FREEPOOL_LAZYINIT, into which all vm_page structures are
initially placed. For this to work, we need only initialize the first
free page of each chunk placed into the buddy allocator. Then, early
page allocations draw from the lazy init pool and initialize vm_page
chunks (up to 16MB, 4096 pages) on demand. Once APs are started, an
idle-priority thread drains the lazy init pool in the background to
avoid introducing extra latency in the allocator. With this scheme,
almost all of the initialization work is moved out of the critical path.

A couple of vm_phys operations require the pool to be drained before
they can run: vm_phys_scan_contig() and vm_phys_unfree_page(). However,
these are rare operations. I believe that
vm_phys_find_freelist_contig() does not require any special treatment,
as it only ever accesses the first page in a power-of-2-sized free page
chunk, which is always initialized.

For now the new pool is only used on amd64 and arm64, since that's where
I can easily test and those platforms would get the most benefit.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 53319
Build 50210: arc lint + arc unit

Event Timeline

markj requested review of this revision.Jun 2 2023, 7:36 PM

Consider e.g. kmem_bootstrap_free(). Isn't the code there depends on the vm_page structures being already initialized?

When I worked on the pcpu + MAXCPU patch, initially I tried to reuse kmem_bootstrap_free() in some form for excess pcpu pages. This did not worked because vm_page's where not initialized, and hacks I tried to init the pages were very ugly. I am not sure how much such instances we have in the tree.

I think I would be easier accepting the approach, if for amd64 you force initialization of all vm_page's structures for pa <= 4G. Not sure what does arm64 need there, if anything.

sys/vm/vm_phys.c
909–927

I'm a bit uneasy with this approach because adding a pool imposes a perpetual, albeit small, run-time cost. Consider this block of code. For as long as the machine is running whenever we need to transfer pages from the default pool to the direct pool, or whenever we are attempting to allocate a contiguous set of pages, like a superpage reservation, and it fails, we are going to have to iterate over more heads of empty lists.

In D40403#919776, @kib wrote:

Consider e.g. kmem_bootstrap_free(). Isn't the code there depends on the vm_page structures being already initialized?

Yes. I forgot about that function and didn't consider that anything would call vm_phys_free_pages() with an uninitialized page. I think this is not too hard to fix though.

When I worked on the pcpu + MAXCPU patch, initially I tried to reuse kmem_bootstrap_free() in some form for excess pcpu pages. This did not worked because vm_page's where not initialized, and hacks I tried to init the pages were very ugly. I am not sure how much such instances we have in the tree.

I think I would be easier accepting the approach, if for amd64 you force initialization of all vm_page's structures for pa <= 4G. Not sure what does arm64 need there, if anything.

Would you instead consider an approach where an INVARIANTS kernel can catch use-before-initialization of vm_pages? In principle this can already be done with KMSAN, but it would be better if any debug kernel can identify bugs, and we don't care as much about boot times in that case.

Initializing everything below 4GB would still hurt boot times for small VMs, and might not catch every case you're worried about.

sys/vm/vm_phys.c
909–927

My feeling was that operations which have to fall back to other freepools are already expensive: they have to touch each page in the returned chunk. But this doesn't apply to some of the examples you gave.

I can't see a way to fully eliminate that overhead. One mitigation which might help would be to make the starting index a variable instead of 0. Initially it would be equal to VM_FREEPOOL_LAZYINIT. Once deferred initialization has finished, it would be set to VM_FREEPOOL_DEFAULT, so we'd avoid touching the new, empty list heads.

Fix kmem_bootstrap_free(): in vm_page_startup(), initialize vm_pages that are
not covered by phys_avail[]. In other words, all pages that belong to a
vm_phys_seg but are not handed over to the vm_phys buddy allocator are
initialized as before.

When searching across free pools for free pages, start searching from
vm_default_freepool instead of 0. When deferred initialization is completed,
a search will no longer visit the queue heads for VM_FREEPOOL_LAZYINIT.

Suppose that a kernel dump happens very early, before pages are initialized. Then, e.g. kernel memory needs the backing pages to be functional for some variations of busdma to work. I know that at very least DMAR busdma requires it.

Another question, how would the lazy initialization interact with the efi_create_1t1_map() use of VM_PHYS_TO_PAGE()? It tries to detect uninitialized page and force-init it, but at least it sounds like a race.

I still think that initing everything below 4G is the best route, if going with the lazy initialization. When/if UEFI would no longer guarantee us that everything sensitive for early startup lives below 4G. we would have no choice but extend this range.

Kboot passes through the memory map from Linux. This map is no longer 1:1 nor necessary below 4GB. The latter I'm not sure I've seen. The former always is the case, unlike what loader.efi provides. And it's impossible to change since one can only call SetVirtualMapping once and Linux already called it by the time we get control.

In D40403#920211, @kib wrote:

Suppose that a kernel dump happens very early, before pages are initialized. Then, e.g. kernel memory needs the backing pages to be functional for some variations of busdma to work. I know that at very least DMAR busdma requires it.

"kernel dump" meaning minidump? I think it is not a problem, since the minidump code always excludes free pages (they are not set in the vm_page_dump bitset), and non-free pages are always initialized.

Another question, how would the lazy initialization interact with the efi_create_1t1_map() use of VM_PHYS_TO_PAGE()? It tries to detect uninitialized page and force-init it, but at least it sounds like a race.

Only pages added to the vm_phys buddy allocator can be uninitialized after vm_page_startup(). So efi_create_1t1_map() should be fine, or else it is reinitializing pages belonging to the buddy allocator, i.e., there is already a bug.

I still think that initing everything below 4G is the best route, if going with the lazy initialization. When/if UEFI would no longer guarantee us that everything sensitive for early startup lives below 4G. we would have no choice but extend this range.

The latest revision of the patch initializes all vm_pages except for those that are initially added to the vm_phys allocator. Any pages sensitive to early startup will thus be initialized.

I am reluctant to initialize everything below 4GB: doing so eliminates the improvement for small VMs, and I cannot see any (more) concrete bugs that would be fixed as a result.

In D40403#920211, @kib wrote:

Suppose that a kernel dump happens very early, before pages are initialized. Then, e.g. kernel memory needs the backing pages to be functional for some variations of busdma to work. I know that at very least DMAR busdma requires it.

"kernel dump" meaning minidump? I think it is not a problem, since the minidump code always excludes free pages (they are not set in the vm_page_dump bitset), and non-free pages are always initialized.

I am not sure I follow. dump_add_page() is only needed for the page that are not mapped into the KVA (excluding DMAP). All active mappings except DMAP are explicitly dumped regardless of the state of the dump bitmap. How would the kernel data be dumped otherwise?

So I think a lot of pages are not initialized until later, but they are potentially dumpable.

Another question, how would the lazy initialization interact with the efi_create_1t1_map() use of VM_PHYS_TO_PAGE()? It tries to detect uninitialized page and force-init it, but at least it sounds like a race.

Only pages added to the vm_phys buddy allocator can be uninitialized after vm_page_startup(). So efi_create_1t1_map() should be fine, or else it is reinitializing pages belonging to the buddy allocator, i.e., there is already a bug.

I still think that initing everything below 4G is the best route, if going with the lazy initialization. When/if UEFI would no longer guarantee us that everything sensitive for early startup lives below 4G. we would have no choice but extend this range.

The latest revision of the patch initializes all vm_pages except for those that are initially added to the vm_phys allocator. Any pages sensitive to early startup will thus be initialized.

I am reluctant to initialize everything below 4GB: doing so eliminates the improvement for small VMs, and I cannot see any (more) concrete bugs that would be fixed as a result.

In D40403#920492, @kib wrote:
In D40403#920211, @kib wrote:

Suppose that a kernel dump happens very early, before pages are initialized. Then, e.g. kernel memory needs the backing pages to be functional for some variations of busdma to work. I know that at very least DMAR busdma requires it.

"kernel dump" meaning minidump? I think it is not a problem, since the minidump code always excludes free pages (they are not set in the vm_page_dump bitset), and non-free pages are always initialized.

I am not sure I follow. dump_add_page() is only needed for the page that are not mapped into the KVA (excluding DMAP). All active mappings except DMAP are explicitly dumped regardless of the state of the dump bitmap. How would the kernel data be dumped otherwise?

So I think a lot of pages are not initialized until later, but they are potentially dumpable.

Only free (in the sense of belonging to the vm_phys buddy allocator) pages are potentially uninitialized. But such pages would not be mapped into KVA except via the DMAP, so I do not see the problem.

sys/vm/vm_page.c
786

Could you please make this runtime-selectable? I.e. add a knob that forces eager vm_page_array initialization by a loader tunable?

markj marked an inline comment as done.

Add a chicken switch: if debug.vm.lazy_page_init=0, then populate
the physical memory allocator as we did before.

This revision is now accepted and ready to land.Jun 8 2023, 4:08 PM

Rebase. Remove vm_phys_seg(), it is no longer needed.

This revision now requires review to proceed.Jun 19 2023, 7:06 PM

@markj Is this waiting for anything else before it lands?

@markj Is this waiting for anything else before it lands?

I am not sure it's a good idea to land the patch right before stable/14 is created. I've done a fair bit of testing locally but there's lots of potential here for bugs/races which are specific to a certain platform. Since such bugs might take a while to flesh out, I'd prefer to wait until after the branch is created.

sys/vm/vm_phys.c
909–927

Do you have any opinion on the vm_default_freepool indirection I added? There is still a small run-time cost of course, but this way we also avoid accessing queue heads that will always be empty after deferred initialization is finished.

sys/vm/vm_phys.c
1464

I'm trying to understand the benefit of passing pa, instead of m, to this function. Passing pa means that we have to repeat the binary search (via vm_phys_padddr_to_seg) that we already do in vm_page_blacklist_addr (via vm_phys_paddr_to_vm_page). So what's the benefit that offsets the cost of duplicating the search for the right segment?

What follows may be a series of stupid questions. If so, feel free to ignore them.

Some of the work of vm_page_init_page is setting fields to non-zero constants. That those constants are non-zero is the result of choices elsewhere in the code - where it was decided that PQ_NONE would be 255, for example. To what extent would changes that made more of these constants zero speed up initialization? I'm told that you cannot expect the fields of the vm_page struct to be zero on the call to vm_page_init_page. If so, would bzeroing the whole struct before setting the nonzero fields help performance at all, especially if there were fewer nonzero fields? Or bzeroing all the page structs before any calls to vm_page_init_page? And then setting only the nonzero fields.