Details

Reviewers

markj
kib
alc

Commits

rG7a79d0669761: vm: improve kstack_object pindex calculation to avoid pindex holes

Summary

This patch replaces the linear transformation of kernel virtual addresses to kstack_object pindex values with a non-linear scheme that circumvents physical memory fragmentation caused by kernel stack guard pages. The new mapping scheme is used to effectively "skip" guard pages and assign pindices for non-guard pages in a contiguous fashion.

This scheme requires that all default-sized kstack KVA allocations come from a separate, specially aligned region of the KVA space. For this to work, the patch also introduces a new, dedicated kstack KVA arena used to allocate kernel stacks of default size. Aside from fullfilling the requirements imposed by the new scheme, a separate kstack KVA arena has additional performance benefits (keeping guard pages in a single KVA region facilitates superpage promotion in the rest of the KVA space) and security benefits (packing kstacks together results in most kstacks having guard pages at both ends).

The patch currently enables this for all platforms, but I'm not sure whether this change would have a negative impact on fragmentation in the kernel's virtual address space when used on 32-bit machines. I'd be happy to change the patch so that this code is only used by 64 bit systems if need be.

Test Plan

The changes were tested on a VM using multiple buildworlds and several stressor classes from stress-ng, with no errors encountered so far.

Update 22.09.2023:
I ran the regression suite with these changes and compared the results with a clean run - no crashes or additional test failures were triggered by this patch.
I'd be happy to upload the result databases if need be.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

kib added inline comments.Sep 25 2023, 1:47 AM

sys/vm/vm_glue.c
588	May be I do not completely understand the calculations, but to me it sounds like the formulas depend on KSTACK_GUARD_PAGES == 1. In particular, both this KASSERT, and the (required?) property that returned non-linear pindex is unique for pindex. Could be that some elaborated explanation would help.
591

In D38852#956753, @kib wrote:

So what is the main purpose of this change? To get linear, without gaps, pindexes for the kernel stack obj?

The physical pages for the kstack object come from reservations. Today, those reservations will never be fully populated because of the gaps. Ultimately, those reservations will get broken, and we'll recover the unused pages. However, because of the gaps, we are allocating 25% more contiguous physical memory than really necessary, assuming a 16KB stack and 4KB guard.

From Ryzen 3000 onward, AMD's MMU automatically promotes four contiguous 16KB-aligned virtual pages mapping four contiguous 16KB-aligned physical pages, creating a single 16KB TLB entry. With this change, the physical contiguity and alignment is guaranteed. Could we dynamically adjust the guard size at runtime to achieve the required virtual alignment?

One other open issue that we didn't discuss much yet is what to do on 32-bit systems. Since the kernel virtual address space is much more limited there, it perhaps (probably?) doesn't make much sense to change the existing scheme. This is especially true if we start trying to align virtual mappings of stacks as Alan suggests.

In D38852#956753, @kib wrote:

So what is the main purpose of this change? To get linear, without gaps, pindexes for the kernel stack obj?

@alc already summed up the core idea behind this change, there is nothing to add aside from the added performance and security benefits described in the summary.

In D38852#956774, @alc wrote:

From Ryzen 3000 onward, AMD's MMU automatically promotes four contiguous 16KB-aligned virtual pages mapping four contiguous 16KB-aligned physical pages, creating a single 16KB TLB entry. With this change, the physical contiguity and alignment is guaranteed. Could we dynamically adjust the guard size at runtime to achieve the required virtual alignment?

So, as @kib pointed out in the comments, the implicit assumption here is that there is only one guard page. However, I went back to the drawing board and adjusted the expression to work with an arbitrary number of guard pages.
I've attached a screenshot of the rendered formula since phabricator does not support inline LaTeX.

A guard page kva is a kva whose remainder modulo kstack_pages + KSTACK_GUARD_PAGES falls inside the following range: [kstack_pages, ..., kstack_pages + KSTACK_GUARD_PAGES - 1]
I'll work this new expression into the patch and address @kib 's comments within a couple of days.

As for the runtime tuning, I think its safe to change the number of guard pages before the kstack kva arenas are initialized, since their size and import quantum depend on that parameter.

bnovkov attached a referenced file: F68370328: kstack_pindex_formula.png. (Show Details)Oct 1 2023, 5:12 PM

Address @kib 's and @alc 's comments.

I've updated the diff to work with an arbitrary number of guard pages but I didn't have time to test this with different guard page configurations - I'll do this in the coming days and report back.
The formula has been slightly reworked to account for the direction of stack growth (+ 1 to the result of lin_pindex(kva) / kstack_size and different condition for detecting guard pages).

Quick update: I've tested the patch with different numbers of kstack guard pages (1-4) and encountered no issues.
Each run consisted of running the test suite and a rebuild of the whole kernel.

alc added inline comments.Nov 12 2023, 9:14 AM

sys/vm/vm_glue.c
323	Deindent.
324	This can be on the previous line.
326	Deindent by 4 positions.
382	Deindent by 4 positions.
409	Otherwise, this variable is reported as unused within a non-INVARIANTS kernel.
413	Deindent by 4 positions.
416	Deindent by 4 positions.
426	Deindent by 4 positions.
710	Deindent by 4 positions.

alc added inline comments.Nov 12 2023, 9:24 PM

sys/vm/vm_glue.c
575	I think "identity" is more descriptive.

Address @alc 's comments.

In D38852#957605, @markj wrote:

One other open issue that we didn't discuss much yet is what to do on 32-bit systems. Since the kernel virtual address space is much more limited there, it perhaps (probably?) doesn't make much sense to change the existing scheme. This is especially true if we start trying to align virtual mappings of stacks as Alan suggests.

Given the recent decision to deprecate and potentially remove support for 32-bit systems with 15.0, do you think that this is still an issue we should address?

sys/vm/vm_glue.c
575	Agreed.

In D38852#975886, @bojan.novkovic_fer.hr wrote:

In D38852#957605, @markj wrote:

One other open issue that we didn't discuss much yet is what to do on 32-bit systems. Since the kernel virtual address space is much more limited there, it perhaps (probably?) doesn't make much sense to change the existing scheme. This is especially true if we start trying to align virtual mappings of stacks as Alan suggests.

Given the recent decision to deprecate and potentially remove support for 32-bit systems with 15.0, do you think that this is still an issue we should address?

Yes - the actual removal of 32-bit kernels isn't going to happen anytime soon. 15.0 is two years out, and it's not entirely clear to me that we're ready to remove 32-bit ARM support.

But, can't we handle this simply by making vm_kstack_pindex() use the old KVA<->pindex mapping scheme if _ILP32 is defined?

sys/vm/vm_glue.c
584	Comment style puts `/` and `/` on their own lines. There are a few instances of this in the patch.

Address @markj 's comments:

multiline comment style fixes
use old mapping scheme on 32-bit systems

In D38852#976094, @markj wrote:

In D38852#975886, @bojan.novkovic_fer.hr wrote:

In D38852#957605, @markj wrote:

One other open issue that we didn't discuss much yet is what to do on 32-bit systems. Since the kernel virtual address space is much more limited there, it perhaps (probably?) doesn't make much sense to change the existing scheme. This is especially true if we start trying to align virtual mappings of stacks as Alan suggests.

Given the recent decision to deprecate and potentially remove support for 32-bit systems with 15.0, do you think that this is still an issue we should address?

Yes - the actual removal of 32-bit kernels isn't going to happen anytime soon. 15.0 is two years out, and it's not entirely clear to me that we're ready to remove 32-bit ARM support.

But, can't we handle this simply by making vm_kstack_pindex() use the old KVA<->pindex mapping scheme if _ILP32 is defined?

Yes, that and making sure that the KVAs are allocated from kernel_arena. I've updated the patch to fall back to the old mapping scheme when __ILP32__ is defined.

dougm added a subscriber: dougm.Jan 28 2024, 5:09 AM

dougm added inline comments.

sys/vm/vm_glue.c
388	What 'npages' means here, and in the next function, conflicts with what npages means in vm_kstack_pindex, and so creates confusion for someone looking at the code for the first time. Can you pick a different name here (kpages?), and then use that name in vm_kstack_pindex for the same value?

dougm added inline comments.Jan 28 2024, 5:22 AM

sys/vm/vm_glue.c
402	Simplify to addrp = roundup(addrp, npages * PAGE_SIZE);

I'd really like to see this committed.

sys/vm/vm_glue.c
330	This isn't needed. kva_alloc() does round_page() on the given size. (And, in fact, that is also redundant because vmem_alloc() rounds to the defined quantum size for the arena.)
332	Style fix
501	Deindent by 4 positions.

kib added inline comments.Mar 12 2024, 2:18 AM

sys/vm/vm_glue.c
296–297
321	This thing would sleep based on the page shortage in the domain. How is it related to the need to allocate from kstack_arena? Typically arena is smaller then domain, am I right? Then this loop becomes busy-loop if kstack arena is exhausted but the domain pages are still available.
597	Can you please provide an explanation for the formula? In particular, why does it provide contiguous mapping for pindex.

Rebase and address comments by @alc , @kib , and @dougm.

Apologies for the delayed response.
@markj and I will go through the patch once more during next week, so it should get commited soon.

sys/vm/vm_glue.c
321	You're right, I've changed the code to check for ENOMEM returned by vmem_alloc and bail if so.
388	I agree, thank you for pointing this out! fixed.
597	So, this is essentially the same linear mapping function but with a non-contiguous domain. The panic here basically checks if we are attempting to calculate the function using numbers that are not part of the domain. I've left a plot of the function in our previous discussion if that helps: link to the plot

kib added inline comments.Mar 18 2024, 1:11 AM

sys/vm/vm_glue.c
321	I again do not understand this. Wouldn't that result in spurious failures to allocate kstacks KVAs?
597	Thanks

markj added inline comments.Mar 21 2024, 5:38 AM

sys/vm/vm_glue.c
321	I agree, I think this use of the iterator doesn't quite make sense. Can't we just allocate the KVA with vmem_alloc(M_WAITOK)? Note that vm_thread_free_kstack_kva() assumes that the KVA belongs to the arena corresponding to the backing pages, so if the iterator here selects a different domain, then later we'll end up freeing the KVA to the wrong arena.

Address @kib 's comments.

bnovkov added inline comments.Mar 22 2024, 3:12 AM

sys/vm/vm_glue.c
321	Oh, sorry, I completely missed your point. I see what you meant now, it really didn't make sense. I've rearranged it so we fetch the kstack KVA from the curcpu domain.

kib added inline comments.Apr 1 2024, 10:37 PM

sys/vm/vm_glue.c
313	I believe M_WAITOK is a regression comparing with the current code. If we currently cannot allocate the new thread' kstack, we return ENOMEM or such. WIth this M_WAITOK allocation, we hang until some other thread exits. Note that ILP32 behavior is to return failure if no KVA can be allocated.
397	I suspect the comment is reversed: don't we release KVA from kstack arena into the parent?
507	Can we directly store the domain of the kstack KVA in the struct thread? Then, if a domain area is exhausted, we can allocate from the other domain arena' and still correctly dispose the thread.

Update patch to track and properly release kstacks to domain arenas, address @kib 's comments.

Sorry for the delay, @markj reported a panic when booting this patch on a NUMA machine and it took me a while to set up a NUMA environment and properly fix the issue.

I ran the regression suite in NUMA VM with the latest patch twice and encountered no panics or issues.

sys/vm/vm_glue.c
507	This is what I ended up on as well, the domain is now stored in struct thread.

If I boot the latest patch on my crash box, I get the following panic, which almost certainly means that we freed KVA to the wrong vmem arena:

ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
panic: Assertion bt != NULL failed at /usr/home/markj/src/freebsd/sys/kern/subr_vmem.c:1487
cpuid = 0
time = 1712147201
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0133e67330
vpanic() at vpanic+0x135/frame 0xfffffe0133e67460
panic() at panic+0x43/frame 0xfffffe0133e674c0
vmem_xfree() at vmem_xfree+0x130/frame 0xfffffe0133e674f0
kstack_release() at kstack_release+0x2f/frame 0xfffffe0133e67520
bucket_drain() at bucket_drain+0x11c/frame 0xfffffe0133e67560
zone_put_bucket() at zone_put_bucket+0x1cc/frame 0xfffffe0133e675b0
cache_free() at cache_free+0x2da/frame 0xfffffe0133e67600
uma_zfree_arg() at uma_zfree_arg+0x22c/frame 0xfffffe0133e67650
thread_reap_domain() at thread_reap_domain+0x1e5/frame 0xfffffe0133e67700
thread_count_inc() at thread_count_inc+0x24/frame 0xfffffe0133e67730
thread_alloc() at thread_alloc+0x13/frame 0xfffffe0133e67760
kthread_add() at kthread_add+0xb7/frame 0xfffffe0133e67830
_taskqueue_start_threads() at _taskqueue_start_threads+0x148/frame 0xfffffe0133e678d0
taskqueue_start_threads_in_proc() at taskqueue_start_threads_in_proc+0x42/frame 0xfffffe0133e67940
taskq_create_impl() at taskq_create_impl+0xca/frame 0xfffffe0133e67980
spa_activate() at spa_activate+0x5da/frame 0xfffffe0133e67a20
spa_import() at spa_import+0x119/frame 0xfffffe0133e67ac0
zfs_ioc_pool_import() at zfs_ioc_pool_import+0xb0/frame 0xfffffe0133e67b00
zfsdev_ioctl_common() at zfsdev_ioctl_common+0x59c/frame 0xfffffe0133e67bc0
zfsdev_ioctl() at zfsdev_ioctl+0x12a/frame 0xfffffe0133e67bf0
devfs_ioctl() at devfs_ioctl+0xd2/frame 0xfffffe0133e67c40
vn_ioctl() at vn_ioctl+0xc2/frame 0xfffffe0133e67cb0
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe0133e67cd0
kern_ioctl() at kern_ioctl+0x286/frame 0xfffffe0133e67d30
sys_ioctl() at sys_ioctl+0x143/frame 0xfffffe0133e67e00
amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe0133e67f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0133e67f30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x1c4e7e41974a, rsp = 0x1c4e7297d8d8, rbp = 0x1c4e7297d950 ---

sys/vm/vm_glue.c
434	Extra newline.
447	Extra newline.
460–478	... and just return 0 here.
476	If you return ks here, then you don't need to be careful about setting `ks = 0` above.
564	`MAXMEMDOM` is probably a suitable sentinel value here.

Address @markj 's comments and fix a couple of issues:

Certain kstack KVA chunks were released back to the parent arena with improper alignment, causing vmem_xfree to panic
swapping in thread kstacks triggered a panic as the former code relied on vm_page_grab_pages

I ran the vm-specific stress2 tests with these changes for a couple of hours without any issues.

kib added inline comments.Apr 6 2024, 11:19 PM

sys/vm/vm_glue.c
463	Am I right that you allocate backing physical pages even for the guard page? If yes, why? This seems to be a waste of the physical memory.

bnovkov added inline comments.Apr 7 2024, 4:58 PM

sys/vm/vm_glue.c
463	No, that would defeat the purpose of this whole patch. But I think I see what you mean, this whole process is a bit convoluted. The `pages` argument that is passed to`vm_thread_stack_back` usually has the same value as `kstack_pages`, so we always allocate that number of pages and never take the guard pages into account. If we tried backing and insert a physical page for a guard page the assertions in `vm_kstack_pindex` would catch that and panic.

kib added inline comments.Apr 7 2024, 8:12 PM

sys/vm/vm_glue.c
463	Let me reformulate my question. Why the pmap_qremove() call at line 469 is needed?

bnovkov added inline comments.Apr 8 2024, 5:39 PM

sys/vm/vm_glue.c
463	That call is there to make sure that any mappings that might've been present at the guard page KVA are removed. This is how the old code dealt with creating guard pages as well. I assume this is used to deal with any mappings that weren't cleaned up properly.

Just so we're all on the same page, I want to point out the following: While this patch achieves contiguity, it doesn't guarantee 2 MB alignment. Let 'F' represent a fully populated 2 MB reservation, 'E', represent a partially populated reservation, where the population begins in the middle and goes to the end, and 'B' is the complement of 'E', where the population begins at the start and ends in the middle. Typically, the physical memory allocation for one chunk of stacks on amd64 looks like 'EFFFB'. While it would be nice to achieve 'FFFF', this patch is already a great improvement over the current state of affairs.

In D38852#1018757, @alc wrote:

Just so we're all on the same page, I want to point out the following: While this patch achieves contiguity, it doesn't guarantee 2 MB alignment. Let 'F' represent a fully populated 2 MB reservation, 'E', represent a partially populated reservation, where the population begins in the middle and goes to the end, and 'B' is the complement of 'E', where the population begins at the start and ends in the middle. Typically, the physical memory allocation for one chunk of stacks on amd64 looks like 'EFFFB'. While it would be nice to achieve 'FFFF', this patch is already a great improvement over the current state of affairs.

But is it possible at all (perhaps a better word is it worth at all) since we do have the guard pages?

sys/vm/vm_swapout.c
568	I.e. deindent 568 to be +4 spaces. Put `;` on its own line.

This revision is now accepted and ready to land.Apr 8 2024, 9:58 PM

I did some light testing on a NUMA system and a 32-bit (armv7) VM.

In D38852#1018877, @kib wrote:

In D38852#1018757, @alc wrote:

Just so we're all on the same page, I want to point out the following: While this patch achieves contiguity, it doesn't guarantee 2 MB alignment. Let 'F' represent a fully populated 2 MB reservation, 'E', represent a partially populated reservation, where the population begins in the middle and goes to the end, and 'B' is the complement of 'E', where the population begins at the start and ends in the middle. Typically, the physical memory allocation for one chunk of stacks on amd64 looks like 'EFFFB'. While it would be nice to achieve 'FFFF', this patch is already a great improvement over the current state of affairs.

But is it possible at all (perhaps a better word is it worth at all) since we do have the guard pages?

Yes, it is. Specifically, we could eliminate the initial partially populated reservation, i.e., it would fill, and also the final partially populated reservation, if this particular chunk of stacks is fully utilized. I will spell out how I think it could be done after this lands.

Closed by commit rG7a79d0669761: vm: improve kstack_object pindex calculation to avoid pindex holes (authored by bnovkov). · Explain WhyApr 10 2024, 3:39 PM

This revision was automatically updated to reflect the committed changes.

bnovkov added a commit: rG7a79d0669761: vm: improve kstack_object pindex calculation to avoid pindex holes.

markj added a subscriber: jhb.May 11 2024, 12:21 AM

markj added inline comments.

sys/vm/vm_swapout.c
568	@jhb hit a bug here that I should have noticed during review, sorry. Suppose we're on a NUMA system and trying to fault in a thread stack from a domain which has no free memory, i.e., we can't immediately reclaim it for some reason. Note that `oom_alloc` will generally be VM_ALLOC_NORMAL. It's possible that we end up looping here forever because we keep trying to allocate a page from a depleted domain. Can we simply fall back to other NUMA domains if needed in order to make progress? That is, can we try to allocate pages using a `DOMAINSET_PREF()` policy rather than requiring that pages come from the original domain? That will require some fixups to `vm_thread_stack_dispose()`, but are there other considerations?

bnovkov marked an inline comment as done.May 11 2024, 10:37 AM

bnovkov added inline comments.

sys/vm/vm_swapout.c
568	We can, but it does require several changes in order to work. More specifically, the kstack cache complicates things since we cannot store the kstack domain in a thread struct when preparing kstacks for the cache, so I've resorted to embedding the kstack KVA domain in the kstack address. https://reviews.freebsd.org/D45163 should fix this issue.

bnovkov mentioned this in D45163: vm: Allow kstack pages to come from different domains.May 12 2024, 10:24 AM

vm: improve kstack_object pindex calculation scheme to avoid pindex holes
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 136865

sys/sys/proc.h

sys/vm/vm_extern.h

sys/vm/vm_glue.c

sys/vm/vm_kern.h

sys/vm/vm_swapout.c

vm: improve kstack_object pindex calculation scheme to avoid pindex holesClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 136865

sys/sys/proc.h

sys/vm/vm_extern.h

sys/vm/vm_glue.c

sys/vm/vm_kern.h

sys/vm/vm_swapout.c

vm: improve kstack_object pindex calculation scheme to avoid pindex holes
ClosedPublic
Actions

Revision Contents
Changeset List