Page MenuHomeFreeBSD

vm: improve kstack_object pindex calculation scheme to avoid pindex holes
ClosedPublic

Authored by bnovkov on Mar 2 2023, 10:00 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Dec 11, 9:31 AM
Unknown Object (File)
Tue, Dec 10, 4:23 AM
Unknown Object (File)
Sat, Dec 7, 10:56 PM
Unknown Object (File)
Fri, Dec 6, 12:48 PM
Unknown Object (File)
Thu, Dec 5, 3:47 AM
Unknown Object (File)
Tue, Dec 3, 5:05 AM
Unknown Object (File)
Sun, Dec 1, 12:05 PM
Unknown Object (File)
Sat, Nov 30, 9:23 AM
Subscribers

Details

Summary

This patch replaces the linear transformation of kernel virtual addresses to kstack_object pindex values with a non-linear scheme that circumvents physical memory fragmentation caused by kernel stack guard pages. The new mapping scheme is used to effectively "skip" guard pages and assign pindices for non-guard pages in a contiguous fashion.

This scheme requires that all default-sized kstack KVA allocations come from a separate, specially aligned region of the KVA space. For this to work, the patch also introduces a new, dedicated kstack KVA arena used to allocate kernel stacks of default size. Aside from fullfilling the requirements imposed by the new scheme, a separate kstack KVA arena has additional performance benefits (keeping guard pages in a single KVA region facilitates superpage promotion in the rest of the KVA space) and security benefits (packing kstacks together results in most kstacks having guard pages at both ends).

The patch currently enables this for all platforms, but I'm not sure whether this change would have a negative impact on fragmentation in the kernel's virtual address space when used on 32-bit machines. I'd be happy to change the patch so that this code is only used by 64 bit systems if need be.

Test Plan

The changes were tested on a VM using multiple buildworlds and several stressor classes from stress-ng, with no errors encountered so far.

Update 22.09.2023:
I ran the regression suite with these changes and compared the results with a clean run - no crashes or additional test failures were triggered by this patch.
I'd be happy to upload the result databases if need be.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/vm/vm_glue.c
588

May be I do not completely understand the calculations, but to me it sounds like the formulas depend on KSTACK_GUARD_PAGES == 1. In particular, both this KASSERT, and the (required?) property that returned non-linear pindex is unique for pindex.

Could be that some elaborated explanation would help.

591
In D38852#956753, @kib wrote:

So what is the main purpose of this change? To get linear, without gaps, pindexes for the kernel stack obj?

The physical pages for the kstack object come from reservations. Today, those reservations will never be fully populated because of the gaps. Ultimately, those reservations will get broken, and we'll recover the unused pages. However, because of the gaps, we are allocating 25% more contiguous physical memory than really necessary, assuming a 16KB stack and 4KB guard.

From Ryzen 3000 onward, AMD's MMU automatically promotes four contiguous 16KB-aligned virtual pages mapping four contiguous 16KB-aligned physical pages, creating a single 16KB TLB entry. With this change, the physical contiguity and alignment is guaranteed. Could we dynamically adjust the guard size at runtime to achieve the required virtual alignment?

One other open issue that we didn't discuss much yet is what to do on 32-bit systems. Since the kernel virtual address space is much more limited there, it perhaps (probably?) doesn't make much sense to change the existing scheme. This is especially true if we start trying to align virtual mappings of stacks as Alan suggests.

In D38852#956753, @kib wrote:

So what is the main purpose of this change? To get linear, without gaps, pindexes for the kernel stack obj?

@alc already summed up the core idea behind this change, there is nothing to add aside from the added performance and security benefits described in the summary.

In D38852#956774, @alc wrote:

From Ryzen 3000 onward, AMD's MMU automatically promotes four contiguous 16KB-aligned virtual pages mapping four contiguous 16KB-aligned physical pages, creating a single 16KB TLB entry. With this change, the physical contiguity and alignment is guaranteed. Could we dynamically adjust the guard size at runtime to achieve the required virtual alignment?

So, as @kib pointed out in the comments, the implicit assumption here is that there is only one guard page. However, I went back to the drawing board and adjusted the expression to work with an arbitrary number of guard pages.
I've attached a screenshot of the rendered formula since phabricator does not support inline LaTeX.

kstack_pindex_formula.png (84×1 px, 27 KB)

A guard page kva is a kva whose remainder modulo kstack_pages + KSTACK_GUARD_PAGES falls inside the following range: [kstack_pages, ..., kstack_pages + KSTACK_GUARD_PAGES - 1]
I'll work this new expression into the patch and address @kib 's comments within a couple of days.

As for the runtime tuning, I think its safe to change the number of guard pages before the kstack kva arenas are initialized, since their size and import quantum depend on that parameter.

I've updated the diff to work with an arbitrary number of guard pages but I didn't have time to test this with different guard page configurations - I'll do this in the coming days and report back.
The formula has been slightly reworked to account for the direction of stack growth (+ 1 to the result of lin_pindex(kva) / kstack_size and different condition for detecting guard pages).

Quick update: I've tested the patch with different numbers of kstack guard pages (1-4) and encountered no issues.
Each run consisted of running the test suite and a rebuild of the whole kernel.

sys/vm/vm_glue.c
323

Deindent.

324

This can be on the previous line.

326

Deindent by 4 positions.

382

Deindent by 4 positions.

409

Otherwise, this variable is reported as unused within a non-INVARIANTS kernel.

413

Deindent by 4 positions.

416

Deindent by 4 positions.

426

Deindent by 4 positions.

710

Deindent by 4 positions.

sys/vm/vm_glue.c
575

I think "identity" is more descriptive.

One other open issue that we didn't discuss much yet is what to do on 32-bit systems. Since the kernel virtual address space is much more limited there, it perhaps (probably?) doesn't make much sense to change the existing scheme. This is especially true if we start trying to align virtual mappings of stacks as Alan suggests.

Given the recent decision to deprecate and potentially remove support for 32-bit systems with 15.0, do you think that this is still an issue we should address?

sys/vm/vm_glue.c
575

Agreed.

In D38852#975886, @bojan.novkovic_fer.hr wrote:

One other open issue that we didn't discuss much yet is what to do on 32-bit systems. Since the kernel virtual address space is much more limited there, it perhaps (probably?) doesn't make much sense to change the existing scheme. This is especially true if we start trying to align virtual mappings of stacks as Alan suggests.

Given the recent decision to deprecate and potentially remove support for 32-bit systems with 15.0, do you think that this is still an issue we should address?

Yes - the actual removal of 32-bit kernels isn't going to happen anytime soon. 15.0 is two years out, and it's not entirely clear to me that we're ready to remove 32-bit ARM support.

But, can't we handle this simply by making vm_kstack_pindex() use the old KVA<->pindex mapping scheme if _ILP32 is defined?

sys/vm/vm_glue.c
584

Comment style puts /* and */ on their own lines. There are a few instances of this in the patch.

bnovkov marked an inline comment as done.

Address @markj 's comments:

  • multiline comment style fixes
  • use old mapping scheme on 32-bit systems
In D38852#975886, @bojan.novkovic_fer.hr wrote:

One other open issue that we didn't discuss much yet is what to do on 32-bit systems. Since the kernel virtual address space is much more limited there, it perhaps (probably?) doesn't make much sense to change the existing scheme. This is especially true if we start trying to align virtual mappings of stacks as Alan suggests.

Given the recent decision to deprecate and potentially remove support for 32-bit systems with 15.0, do you think that this is still an issue we should address?

Yes - the actual removal of 32-bit kernels isn't going to happen anytime soon. 15.0 is two years out, and it's not entirely clear to me that we're ready to remove 32-bit ARM support.

But, can't we handle this simply by making vm_kstack_pindex() use the old KVA<->pindex mapping scheme if _ILP32 is defined?

Yes, that and making sure that the KVAs are allocated from kernel_arena. I've updated the patch to fall back to the old mapping scheme when __ILP32__ is defined.

dougm added inline comments.
sys/vm/vm_glue.c
388

What 'npages' means here, and in the next function, conflicts with what npages means in vm_kstack_pindex, and so creates confusion for someone looking at the code for the first time. Can you pick a different name here (kpages?), and then use that name in vm_kstack_pindex for the same value?

sys/vm/vm_glue.c
402

Simplify to
*addrp = roundup(*addrp, npages * PAGE_SIZE);

I'd really like to see this committed.

sys/vm/vm_glue.c
330

This isn't needed. kva_alloc() does round_page() on the given size. (And, in fact, that is also redundant because vmem_alloc() rounds to the defined quantum size for the arena.)

332

Style fix

501

Deindent by 4 positions.

sys/vm/vm_glue.c
296–297
321

This thing would sleep based on the page shortage in the domain. How is it related to the need to allocate from kstack_arena?

Typically arena is smaller then domain, am I right? Then this loop becomes busy-loop if kstack arena is exhausted but the domain pages are still available.

597

Can you please provide an explanation for the formula? In particular, why does it provide contiguous mapping for pindex.

Rebase and address comments by @alc , @kib , and @dougm.

bnovkov marked 6 inline comments as done.EditedMar 15 2024, 8:14 PM

Apologies for the delayed response.
@markj and I will go through the patch once more during next week, so it should get commited soon.

sys/vm/vm_glue.c
321

You're right, I've changed the code to check for ENOMEM returned by vmem_alloc and bail if so.

388

I agree, thank you for pointing this out! fixed.

597

So, this is essentially the same linear mapping function but with a non-contiguous domain.
The panic here basically checks if we are attempting to calculate the function using numbers that are not part of the domain.

I've left a plot of the function in our previous discussion if that helps: link to the plot

sys/vm/vm_glue.c
321

I again do not understand this. Wouldn't that result in spurious failures to allocate kstacks KVAs?

597

Thanks

sys/vm/vm_glue.c
321

I agree, I think this use of the iterator doesn't quite make sense. Can't we just allocate the KVA with vmem_alloc(M_WAITOK)?

Note that vm_thread_free_kstack_kva() assumes that the KVA belongs to the arena corresponding to the backing pages, so if the iterator here selects a different domain, then later we'll end up freeing the KVA to the wrong arena.

bnovkov marked an inline comment as done.

Address @kib 's comments.

sys/vm/vm_glue.c
321

Oh, sorry, I completely missed your point. I see what you meant now, it really didn't make sense.
I've rearranged it so we fetch the kstack KVA from the curcpu domain.

sys/vm/vm_glue.c
313

I believe M_WAITOK is a regression comparing with the current code. If we currently cannot allocate the new thread' kstack, we return ENOMEM or such. WIth this M_WAITOK allocation, we hang until some other thread exits.

Note that ILP32 behavior is to return failure if no KVA can be allocated.

397

I suspect the comment is reversed: don't we release KVA from kstack arena into the parent?

507

Can we directly store the domain of the kstack KVA in the struct thread? Then, if a domain area is exhausted, we can allocate from the other domain arena' and still correctly dispose the thread.

Update patch to track and properly release kstacks to domain arenas, address @kib 's comments.

Sorry for the delay, @markj reported a panic when booting this patch on a NUMA machine and it took me a while to set up a NUMA environment and properly fix the issue.

I ran the regression suite in NUMA VM with the latest patch twice and encountered no panics or issues.

sys/vm/vm_glue.c
507

This is what I ended up on as well, the domain is now stored in struct thread.

If I boot the latest patch on my crash box, I get the following panic, which almost certainly means that we freed KVA to the wrong vmem arena:

ZFS filesystem version: 5
ZFS storage pool version: features support (5000)
panic: Assertion bt != NULL failed at /usr/home/markj/src/freebsd/sys/kern/subr_vmem.c:1487
cpuid = 0
time = 1712147201
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0133e67330
vpanic() at vpanic+0x135/frame 0xfffffe0133e67460
panic() at panic+0x43/frame 0xfffffe0133e674c0
vmem_xfree() at vmem_xfree+0x130/frame 0xfffffe0133e674f0
kstack_release() at kstack_release+0x2f/frame 0xfffffe0133e67520
bucket_drain() at bucket_drain+0x11c/frame 0xfffffe0133e67560
zone_put_bucket() at zone_put_bucket+0x1cc/frame 0xfffffe0133e675b0
cache_free() at cache_free+0x2da/frame 0xfffffe0133e67600
uma_zfree_arg() at uma_zfree_arg+0x22c/frame 0xfffffe0133e67650
thread_reap_domain() at thread_reap_domain+0x1e5/frame 0xfffffe0133e67700
thread_count_inc() at thread_count_inc+0x24/frame 0xfffffe0133e67730
thread_alloc() at thread_alloc+0x13/frame 0xfffffe0133e67760
kthread_add() at kthread_add+0xb7/frame 0xfffffe0133e67830
_taskqueue_start_threads() at _taskqueue_start_threads+0x148/frame 0xfffffe0133e678d0
taskqueue_start_threads_in_proc() at taskqueue_start_threads_in_proc+0x42/frame 0xfffffe0133e67940
taskq_create_impl() at taskq_create_impl+0xca/frame 0xfffffe0133e67980
spa_activate() at spa_activate+0x5da/frame 0xfffffe0133e67a20
spa_import() at spa_import+0x119/frame 0xfffffe0133e67ac0
zfs_ioc_pool_import() at zfs_ioc_pool_import+0xb0/frame 0xfffffe0133e67b00
zfsdev_ioctl_common() at zfsdev_ioctl_common+0x59c/frame 0xfffffe0133e67bc0
zfsdev_ioctl() at zfsdev_ioctl+0x12a/frame 0xfffffe0133e67bf0
devfs_ioctl() at devfs_ioctl+0xd2/frame 0xfffffe0133e67c40
vn_ioctl() at vn_ioctl+0xc2/frame 0xfffffe0133e67cb0
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe0133e67cd0
kern_ioctl() at kern_ioctl+0x286/frame 0xfffffe0133e67d30
sys_ioctl() at sys_ioctl+0x143/frame 0xfffffe0133e67e00
amd64_syscall() at amd64_syscall+0x153/frame 0xfffffe0133e67f30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0133e67f30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x1c4e7e41974a, rsp = 0x1c4e7297d8d8, rbp = 0x1c4e7297d950 ---
sys/vm/vm_glue.c
434

Extra newline.

447

Extra newline.

460–478

... and just return 0 here.

476

If you return ks here, then you don't need to be careful about setting ks = 0 above.

564

MAXMEMDOM is probably a suitable sentinel value here.

bnovkov marked an inline comment as done.

Address @markj 's comments and fix a couple of issues:

  • Certain kstack KVA chunks were released back to the parent arena with improper alignment, causing vmem_xfree to panic
  • swapping in thread kstacks triggered a panic as the former code relied on vm_page_grab_pages

I ran the vm-specific stress2 tests with these changes for a couple of hours without any issues.

sys/vm/vm_glue.c
463

Am I right that you allocate backing physical pages even for the guard page? If yes, why? This seems to be a waste of the physical memory.

sys/vm/vm_glue.c
463

No, that would defeat the purpose of this whole patch.
But I think I see what you mean, this whole process is a bit convoluted.

The pages argument that is passed to`vm_thread_stack_back` usually has the same value as kstack_pages, so we always allocate that number of pages and never take the guard pages into account.
If we tried backing and insert a physical page for a guard page the assertions in vm_kstack_pindex would catch that and panic.

sys/vm/vm_glue.c
463

Let me reformulate my question. Why the pmap_qremove() call at line 469 is needed?

sys/vm/vm_glue.c
463

That call is there to make sure that any mappings that might've been present at the guard page KVA are removed.
This is how the old code dealt with creating guard pages as well. I assume this is used to deal with any mappings that weren't cleaned up properly.

Just so we're all on the same page, I want to point out the following: While this patch achieves contiguity, it doesn't guarantee 2 MB alignment. Let 'F' represent a fully populated 2 MB reservation, 'E', represent a partially populated reservation, where the population begins in the middle and goes to the end, and 'B' is the complement of 'E', where the population begins at the start and ends in the middle. Typically, the physical memory allocation for one chunk of stacks on amd64 looks like 'EFFFB'. While it would be nice to achieve 'FFFF', this patch is already a great improvement over the current state of affairs.

In D38852#1018757, @alc wrote:

Just so we're all on the same page, I want to point out the following: While this patch achieves contiguity, it doesn't guarantee 2 MB alignment. Let 'F' represent a fully populated 2 MB reservation, 'E', represent a partially populated reservation, where the population begins in the middle and goes to the end, and 'B' is the complement of 'E', where the population begins at the start and ends in the middle. Typically, the physical memory allocation for one chunk of stacks on amd64 looks like 'EFFFB'. While it would be nice to achieve 'FFFF', this patch is already a great improvement over the current state of affairs.

But is it possible at all (perhaps a better word is it worth at all) since we do have the guard pages?

sys/vm/vm_swapout.c
568

I.e. deindent 568 to be +4 spaces. Put ; on its own line.

This revision is now accepted and ready to land.Apr 8 2024, 9:58 PM

I did some light testing on a NUMA system and a 32-bit (armv7) VM.

In D38852#1018877, @kib wrote:
In D38852#1018757, @alc wrote:

Just so we're all on the same page, I want to point out the following: While this patch achieves contiguity, it doesn't guarantee 2 MB alignment. Let 'F' represent a fully populated 2 MB reservation, 'E', represent a partially populated reservation, where the population begins in the middle and goes to the end, and 'B' is the complement of 'E', where the population begins at the start and ends in the middle. Typically, the physical memory allocation for one chunk of stacks on amd64 looks like 'EFFFB'. While it would be nice to achieve 'FFFF', this patch is already a great improvement over the current state of affairs.

But is it possible at all (perhaps a better word is it worth at all) since we do have the guard pages?

Yes, it is. Specifically, we could eliminate the initial partially populated reservation, i.e., it would fill, and also the final partially populated reservation, if this particular chunk of stacks is fully utilized. I will spell out how I think it could be done after this lands.

markj added inline comments.
sys/vm/vm_swapout.c
568

@jhb hit a bug here that I should have noticed during review, sorry.

Suppose we're on a NUMA system and trying to fault in a thread stack from a domain which has no free memory, i.e., we can't immediately reclaim it for some reason. Note that oom_alloc will generally be VM_ALLOC_NORMAL. It's possible that we end up looping here forever because we keep trying to allocate a page from a depleted domain.

Can we simply fall back to other NUMA domains if needed in order to make progress? That is, can we try to allocate pages using a DOMAINSET_PREF() policy rather than requiring that pages come from the original domain? That will require some fixups to vm_thread_stack_dispose(), but are there other considerations?

bnovkov added inline comments.
sys/vm/vm_swapout.c
568

We can, but it does require several changes in order to work.
More specifically, the kstack cache complicates things since we cannot store the kstack domain in a thread struct when preparing kstacks for the cache, so I've resorted to embedding the kstack KVA domain in the kstack address.
https://reviews.freebsd.org/D45163 should fix this issue.