Page MenuHomeFreeBSD

vm_page: Implement lazy page initialization
Needs ReviewPublic

Authored by markj on Jun 2 2023, 7:36 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Apr 15, 11:52 PM
Unknown Object (File)
Wed, Apr 10, 9:09 PM
Unknown Object (File)
Feb 11 2024, 6:19 PM
Unknown Object (File)
Jan 25 2024, 6:09 AM
Unknown Object (File)
Jan 24 2024, 3:40 AM
Unknown Object (File)
Jan 23 2024, 9:59 PM
Unknown Object (File)
Jan 16 2024, 4:14 PM
Unknown Object (File)
Dec 31 2023, 5:36 PM

Details

Summary

FreeBSD's boot times have decreased to the point where vm_page array
initialization represents a significant fraction of the total boot time.
For example, when booting FreeBSD in Firecracker (a VMM designed to
support lightweight VMs) with 128MB and 1GB of RAM, vm_page
initialization consumes 9% (3ms) and 37% (21.5ms) of the kernel boot
time, respectively. This is generally relevant in cloud environments,
where one wants to be able to spin up VMs as quickly as possible.

This patch implements lazy initialization of (most) page structures,
following a suggestion from cperciva@. The idea is to introduce a new
free pool, VM_FREEPOOL_LAZYINIT, into which all vm_page structures are
initially placed. For this to work, we need only initialize the first
free page of each chunk placed into the buddy allocator. Then, early
page allocations draw from the lazy init pool and initialize vm_page
chunks (up to 16MB, 4096 pages) on demand. Once APs are started, an
idle-priority thread drains the lazy init pool in the background to
avoid introducing extra latency in the allocator. With this scheme,
almost all of the initialization work is moved out of the critical path.

A couple of vm_phys operations require the pool to be drained before
they can run: vm_phys_scan_contig() and vm_phys_unfree_page(). However,
these are rare operations. I believe that
vm_phys_find_freelist_contig() does not require any special treatment,
as it only ever accesses the first page in a power-of-2-sized free page
chunk, which is always initialized.

For now the new pool is only used on amd64 and arm64, since that's where
I can easily test and those platforms would get the most benefit.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 55665
Build 52554: arc lint + arc unit

Event Timeline

markj requested review of this revision.Jun 2 2023, 7:36 PM

Consider e.g. kmem_bootstrap_free(). Isn't the code there depends on the vm_page structures being already initialized?

When I worked on the pcpu + MAXCPU patch, initially I tried to reuse kmem_bootstrap_free() in some form for excess pcpu pages. This did not worked because vm_page's where not initialized, and hacks I tried to init the pages were very ugly. I am not sure how much such instances we have in the tree.

I think I would be easier accepting the approach, if for amd64 you force initialization of all vm_page's structures for pa <= 4G. Not sure what does arm64 need there, if anything.

sys/vm/vm_phys.c
914–932

I'm a bit uneasy with this approach because adding a pool imposes a perpetual, albeit small, run-time cost. Consider this block of code. For as long as the machine is running whenever we need to transfer pages from the default pool to the direct pool, or whenever we are attempting to allocate a contiguous set of pages, like a superpage reservation, and it fails, we are going to have to iterate over more heads of empty lists.

In D40403#919776, @kib wrote:

Consider e.g. kmem_bootstrap_free(). Isn't the code there depends on the vm_page structures being already initialized?

Yes. I forgot about that function and didn't consider that anything would call vm_phys_free_pages() with an uninitialized page. I think this is not too hard to fix though.

When I worked on the pcpu + MAXCPU patch, initially I tried to reuse kmem_bootstrap_free() in some form for excess pcpu pages. This did not worked because vm_page's where not initialized, and hacks I tried to init the pages were very ugly. I am not sure how much such instances we have in the tree.

I think I would be easier accepting the approach, if for amd64 you force initialization of all vm_page's structures for pa <= 4G. Not sure what does arm64 need there, if anything.

Would you instead consider an approach where an INVARIANTS kernel can catch use-before-initialization of vm_pages? In principle this can already be done with KMSAN, but it would be better if any debug kernel can identify bugs, and we don't care as much about boot times in that case.

Initializing everything below 4GB would still hurt boot times for small VMs, and might not catch every case you're worried about.

sys/vm/vm_phys.c
914–932

My feeling was that operations which have to fall back to other freepools are already expensive: they have to touch each page in the returned chunk. But this doesn't apply to some of the examples you gave.

I can't see a way to fully eliminate that overhead. One mitigation which might help would be to make the starting index a variable instead of 0. Initially it would be equal to VM_FREEPOOL_LAZYINIT. Once deferred initialization has finished, it would be set to VM_FREEPOOL_DEFAULT, so we'd avoid touching the new, empty list heads.

Fix kmem_bootstrap_free(): in vm_page_startup(), initialize vm_pages that are
not covered by phys_avail[]. In other words, all pages that belong to a
vm_phys_seg but are not handed over to the vm_phys buddy allocator are
initialized as before.

When searching across free pools for free pages, start searching from
vm_default_freepool instead of 0. When deferred initialization is completed,
a search will no longer visit the queue heads for VM_FREEPOOL_LAZYINIT.

Suppose that a kernel dump happens very early, before pages are initialized. Then, e.g. kernel memory needs the backing pages to be functional for some variations of busdma to work. I know that at very least DMAR busdma requires it.

Another question, how would the lazy initialization interact with the efi_create_1t1_map() use of VM_PHYS_TO_PAGE()? It tries to detect uninitialized page and force-init it, but at least it sounds like a race.

I still think that initing everything below 4G is the best route, if going with the lazy initialization. When/if UEFI would no longer guarantee us that everything sensitive for early startup lives below 4G. we would have no choice but extend this range.

Kboot passes through the memory map from Linux. This map is no longer 1:1 nor necessary below 4GB. The latter I'm not sure I've seen. The former always is the case, unlike what loader.efi provides. And it's impossible to change since one can only call SetVirtualMapping once and Linux already called it by the time we get control.

In D40403#920211, @kib wrote:

Suppose that a kernel dump happens very early, before pages are initialized. Then, e.g. kernel memory needs the backing pages to be functional for some variations of busdma to work. I know that at very least DMAR busdma requires it.

"kernel dump" meaning minidump? I think it is not a problem, since the minidump code always excludes free pages (they are not set in the vm_page_dump bitset), and non-free pages are always initialized.

Another question, how would the lazy initialization interact with the efi_create_1t1_map() use of VM_PHYS_TO_PAGE()? It tries to detect uninitialized page and force-init it, but at least it sounds like a race.

Only pages added to the vm_phys buddy allocator can be uninitialized after vm_page_startup(). So efi_create_1t1_map() should be fine, or else it is reinitializing pages belonging to the buddy allocator, i.e., there is already a bug.

I still think that initing everything below 4G is the best route, if going with the lazy initialization. When/if UEFI would no longer guarantee us that everything sensitive for early startup lives below 4G. we would have no choice but extend this range.

The latest revision of the patch initializes all vm_pages except for those that are initially added to the vm_phys allocator. Any pages sensitive to early startup will thus be initialized.

I am reluctant to initialize everything below 4GB: doing so eliminates the improvement for small VMs, and I cannot see any (more) concrete bugs that would be fixed as a result.

In D40403#920211, @kib wrote:

Suppose that a kernel dump happens very early, before pages are initialized. Then, e.g. kernel memory needs the backing pages to be functional for some variations of busdma to work. I know that at very least DMAR busdma requires it.

"kernel dump" meaning minidump? I think it is not a problem, since the minidump code always excludes free pages (they are not set in the vm_page_dump bitset), and non-free pages are always initialized.

I am not sure I follow. dump_add_page() is only needed for the page that are not mapped into the KVA (excluding DMAP). All active mappings except DMAP are explicitly dumped regardless of the state of the dump bitmap. How would the kernel data be dumped otherwise?

So I think a lot of pages are not initialized until later, but they are potentially dumpable.

Another question, how would the lazy initialization interact with the efi_create_1t1_map() use of VM_PHYS_TO_PAGE()? It tries to detect uninitialized page and force-init it, but at least it sounds like a race.

Only pages added to the vm_phys buddy allocator can be uninitialized after vm_page_startup(). So efi_create_1t1_map() should be fine, or else it is reinitializing pages belonging to the buddy allocator, i.e., there is already a bug.

I still think that initing everything below 4G is the best route, if going with the lazy initialization. When/if UEFI would no longer guarantee us that everything sensitive for early startup lives below 4G. we would have no choice but extend this range.

The latest revision of the patch initializes all vm_pages except for those that are initially added to the vm_phys allocator. Any pages sensitive to early startup will thus be initialized.

I am reluctant to initialize everything below 4GB: doing so eliminates the improvement for small VMs, and I cannot see any (more) concrete bugs that would be fixed as a result.

In D40403#920492, @kib wrote:
In D40403#920211, @kib wrote:

Suppose that a kernel dump happens very early, before pages are initialized. Then, e.g. kernel memory needs the backing pages to be functional for some variations of busdma to work. I know that at very least DMAR busdma requires it.

"kernel dump" meaning minidump? I think it is not a problem, since the minidump code always excludes free pages (they are not set in the vm_page_dump bitset), and non-free pages are always initialized.

I am not sure I follow. dump_add_page() is only needed for the page that are not mapped into the KVA (excluding DMAP). All active mappings except DMAP are explicitly dumped regardless of the state of the dump bitmap. How would the kernel data be dumped otherwise?

So I think a lot of pages are not initialized until later, but they are potentially dumpable.

Only free (in the sense of belonging to the vm_phys buddy allocator) pages are potentially uninitialized. But such pages would not be mapped into KVA except via the DMAP, so I do not see the problem.

sys/vm/vm_page.c
788

Could you please make this runtime-selectable? I.e. add a knob that forces eager vm_page_array initialization by a loader tunable?

markj marked an inline comment as done.

Add a chicken switch: if debug.vm.lazy_page_init=0, then populate
the physical memory allocator as we did before.

This revision is now accepted and ready to land.Jun 8 2023, 4:08 PM

Rebase. Remove vm_phys_seg(), it is no longer needed.

This revision now requires review to proceed.Jun 19 2023, 7:06 PM

@markj Is this waiting for anything else before it lands?

@markj Is this waiting for anything else before it lands?

I am not sure it's a good idea to land the patch right before stable/14 is created. I've done a fair bit of testing locally but there's lots of potential here for bugs/races which are specific to a certain platform. Since such bugs might take a while to flesh out, I'd prefer to wait until after the branch is created.

sys/vm/vm_phys.c
914–932

Do you have any opinion on the vm_default_freepool indirection I added? There is still a small run-time cost of course, but this way we also avoid accessing queue heads that will always be empty after deferred initialization is finished.

sys/vm/vm_phys.c
1468

I'm trying to understand the benefit of passing pa, instead of m, to this function. Passing pa means that we have to repeat the binary search (via vm_phys_padddr_to_seg) that we already do in vm_page_blacklist_addr (via vm_phys_paddr_to_vm_page). So what's the benefit that offsets the cost of duplicating the search for the right segment?

What follows may be a series of stupid questions. If so, feel free to ignore them.

Some of the work of vm_page_init_page is setting fields to non-zero constants. That those constants are non-zero is the result of choices elsewhere in the code - where it was decided that PQ_NONE would be 255, for example. To what extent would changes that made more of these constants zero speed up initialization? I'm told that you cannot expect the fields of the vm_page struct to be zero on the call to vm_page_init_page. If so, would bzeroing the whole struct before setting the nonzero fields help performance at all, especially if there were fewer nonzero fields? Or bzeroing all the page structs before any calls to vm_page_init_page? And then setting only the nonzero fields.

@markj Is this waiting for anything else before it lands?

I am not sure it's a good idea to land the patch right before stable/14 is created. I've done a fair bit of testing locally but there's lots of potential here for bugs/races which are specific to a certain platform. Since such bugs might take a while to flesh out, I'd prefer to wait until after the branch is created.

Makes sense. We released 14.0 a while ago now; is it time to land this?

What follows may be a series of stupid questions. If so, feel free to ignore them.

Some of the work of vm_page_init_page is setting fields to non-zero constants. That those constants are non-zero is the result of choices elsewhere in the code - where it was decided that PQ_NONE would be 255, for example. To what extent would changes that made more of these constants zero speed up initialization? I'm told that you cannot expect the fields of the vm_page struct to be zero on the call to vm_page_init_page. If so, would bzeroing the whole struct before setting the nonzero fields help performance at all, especially if there were fewer nonzero fields? Or bzeroing all the page structs before any calls to vm_page_init_page? And then setting only the nonzero fields.

I'm sorry, I meant to reply to this. In fact, we at one point did zero the whole array before setting non-zero fields. This was changed in https://cgit.freebsd.org/src/commit/?id=f93f7cf199b6b07464760dbbbcae84edead6e1ee. At the time I experimented with a few different approaches, including bzeroing each structure before setting non-zero fields, but I couldn't find one that gave a significant benefit over the others.

sys/vm/vm_phys.c
1468

This change was done to avoid accessing potentially uninitialized fields of the input page structure. In other words, this change is required for correctness.

I would also argue that calls to vm_phys_unfree_addr() are uncommon and infrequent, so the extra cost is negligible. We could however eliminate it by returning a pointer to the page rather than a bool (and using a NULL return value to indicate that the page was not found). Then there would be no need for vm_page_blacklist_add() to look up the page.

sys/vm/vm_phys.c
1468

This isn't quite straightforward, since vm_page_blacklist_add() has to distinguish between the cases 1) the page doesn't exist, 2) the page exists but isn't in the free queues, 3) the page exists and is (was) in the free queues.

Though, only one caller actually cares about the return value of vm_page_blacklist_add(), so perhaps it could be modified to distinguish cases 1 and 2 on its own.

In any case, I'm still quite sure that the extra overhead here is negligible. It could be solved, but I'd rather do that in a follow-up patch.

@alc and @dougm, do you have any comments on the updated patch?

sys/vm/vm_page.c
778

I don't that you need the word "simply" here. :-)

sys/vm/vm_phys.c
914–932

I'm curious as to why you chose to put the init pool first rather than last (and instead turn VM_NFREEPOOL into a variable)? I'm guessing that you were concerned about the loops that transfer chunks of pages. You wanted them to start with the init pool.

1491

This can be done in constant time, avoiding iteration on arm64.

Introducing an inline vm_phys_seg_paddr_to_vm_page(seg, pa) probably makes sense. I only regret the length of the proposed function name, but it follows the convention here.

What happens if we increase VM_NFREEORDER, e.g., 262144, to support 1 GB allocations? I think that you might might want to have a different constant to cap the lazy init chunk size.

In D40403#994971, @alc wrote:

What happens if we increase VM_NFREEORDER, e.g., 262144, to support 1 GB allocations? I think that you might might want to have a different constant to cap the lazy init chunk size.

The intent being to limit the overhead of a memory allocation in the worst case? Yes, that seems like a reasonable thing to do.

As far as the implementation goes, my inclination is to change vm_phys_alloc_freelist_pages() and vm_phys_alloc_npages() to 1) initialize only, say, 16MB worth of pages, 2) return the excess pages back to the init free pool rather than to the target free pool.

sys/vm/vm_phys.c
914–932

That was exactly my intent. Whenever the selected free pool is empty, we should check the init pool as long as there's a chance that it's non-empty. I was trying to avoid adding conditional logic to this case.

1491

Fortunately, vm_phys_seg_paddr_to_vm_page(seg, pa) is slightly shorter than &seg->first_page[atop(pa - seg->start)].

I uploaded a separate patch to add this helper function.

Handle Alan's comments except for the one regarding a maximum chunk size.

In D40403#994971, @alc wrote:

What happens if we increase VM_NFREEORDER, e.g., 262144, to support 1 GB allocations? I think that you might might want to have a different constant to cap the lazy init chunk size.

Thinking about this a bit more: if we changed VM_NFREEORDER, any allocation which requires the chunk to be lazily initialized would otherwise have to visit each page in the chunk to set the pool field of each vm_page. Do we care about limiting that overhead as well? That is, should the "lazy init chunk size" really be a "pool init chunk size"?