Page MenuHomeFreeBSD

vm: improve kstack_object pindex calculation scheme to avoid pindex holes
Needs ReviewPublic

Authored by bnovkov on Mar 2 2023, 10:00 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Feb 17, 6:47 PM
Unknown Object (File)
Jan 28 2024, 8:46 PM
Unknown Object (File)
Jan 28 2024, 4:08 AM
Unknown Object (File)
Dec 26 2023, 5:49 AM
Unknown Object (File)
Dec 20 2023, 7:51 AM
Unknown Object (File)
Dec 10 2023, 9:31 AM
Unknown Object (File)
Nov 28 2023, 4:19 PM
Unknown Object (File)
Nov 28 2023, 4:07 PM
Subscribers

Details

Reviewers
markj
kib
alc
Summary

This patch replaces the linear transformation of kernel virtual addresses to kstack_object pindex values with a non-linear scheme that circumvents physical memory fragmentation caused by kernel stack guard pages. The new mapping scheme is used to effectively "skip" guard pages and assign pindices for non-guard pages in a contiguous fashion.

This scheme requires that all default-sized kstack KVA allocations come from a separate, specially aligned region of the KVA space. For this to work, the patch also introduces a new, dedicated kstack KVA arena used to allocate kernel stacks of default size. Aside from fullfilling the requirements imposed by the new scheme, a separate kstack KVA arena has additional performance benefits (keeping guard pages in a single KVA region facilitates superpage promotion in the rest of the KVA space) and security benefits (packing kstacks together results in most kstacks having guard pages at both ends).

The patch currently enables this for all platforms, but I'm not sure whether this change would have a negative impact on fragmentation in the kernel's virtual address space when used on 32-bit machines. I'd be happy to change the patch so that this code is only used by 64 bit systems if need be.

Test Plan

The changes were tested on a VM using multiple buildworlds and several stressor classes from stress-ng, with no errors encountered so far.

Update 22.09.2023:
I ran the regression suite with these changes and compared the results with a clean run - no crashes or additional test failures were triggered by this patch.
I'd be happy to upload the result databases if need be.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/vm/vm_glue.c
582
sys/vm/vm_kern.c
874 ↗(On Diff #118170)

Is there any reason the kstack arena and these import functions can't all be defined with the rest of the kstack code in vm_glue.c? I think that would be a bit cleaner.

929 ↗(On Diff #118170)

I suspect that this is too much on 32-bit systems with a limited KVA space. In fact, I'm not sure that we want to import more than the bare minimum on such systems. That is, we might want to keep the existing scheme if __ILP32__ is defined.

sys/vm/vm_swapout.c
554

I think this code should be encapsulated with the rest of the kstack code in vm_glue.c, rather than exporting vm_kstack_pindex(). That could be done in a separate revision though.

bnovkov added inline comments.
sys/vm/vm_kern.c
874 ↗(On Diff #118170)

I've moved kva_{alloc, free}_kstack to vm_glue as there was no real need outside of vm_glue.
As for the arena import functions, they use KVA_QUANTUM and moving them would require exposing it, which seems unnecessary.
The arena itself needs to be defined in subr_vmem.c because vmem.h only provides a forward declaration of struct vmem, the acual definition of struct vmem is in subr_vmem.c

929 ↗(On Diff #118170)

I've adjusted the kstack_quantum calculation to reflect your comment.

sys/vm/vm_swapout.c
554

I agree, and I'll make these changes in a separate revision to keep things clean,

sys/vm/vm_glue.c
304

Let's keep the namespace consistent and use vm_thread as a prefix like others do below, e.g., this should really be named vm_thread_kva_alloc() or so.

577
585
sys/vm/vm_kern.c
874 ↗(On Diff #118170)

I don't think there's anything wrong with having a private definition of KVA_QUANTUM in vm_glue.c.

For the vmem arena, you can use vmem_create() instead of allocating the structure statically. kstack_cache_init() runs late enough that vmem_create() will work, I believe.

bnovkov marked an inline comment as done.

Move changes to vm_glue.c.

bnovkov added inline comments.
sys/vm/vm_glue.c
304

I've changed the names to be consistent with the naming scheme.

sys/vm/vm_kern.c
874 ↗(On Diff #118170)

Thank you for pointing out vmem_create, I completely missed it.

I've moved most of the changes to vm_glue.c.

I think the requirement that the kstack KVA base be a multiple of kstack_pages + KSTACK_GUARD_PAGES is somewhat challenging to handle. I wonder if we can drop that requirement with the following scheme for deriving a pindex.

Let N = kstack_pages, M = N + KSTACK_GUARD_PAGES. Assume that N < M < 2N. Given a KVA K, let v(K) = (K / PAGE_SIZE) / N.

Allocate a kstack of M pages with base address K. Then, for each page in the stack, v(Kn) = v(K) or v(Kn) = v(K) + 1. Define the pindex of Kn to be v(K) * N + (Kn - K) / PAGE_SIZE - (M - N).

Conceptually, consider covering the KVA space with runs of N pages. Then each kstack is covered by exactly two consecutive runs. Map each kstack to the first of the two runs, giving a unique run of N pages for each kstack, and map each non-guard page of the kstack to successive indices of each run. Then K doesn't need any particular alignment, and we still avoid creating holes in the kstack_object pindex space. Does that work, or am I missing something?

sys/vm/vm_glue.c
308
314

We're assuming here that atop(addr) % (kstack_pages + KSTACK_GUARD_PAGES) == 0, right? I think it's a bit of a layering violation to assume that vmem will always satisfy that assumption...

326
354

The assertion also looks wrong since size is a byte count and npages is a page count. atop(size) % npages == 0 might be more natural.

355

The function name in the assertion message is wrong. IMO it's better to use %s and __func__.

365
368
405

This return statement can be removed.

473

vm_thread_stack_back() uses vm_kstack_pindex(), which assumes that pages == kstack_pages, but that's not true here. Or am I missing something?

I think we at least need some way to differentiate between kstack_pages-sized kstacks and others in the pindex space. Otherwise, isn't it possible for them to collide? To handle this we could, for example use a separate VM object for non-default kstack sizes.

682

Same comments as above w.r.t. excessive parentheses.

Does that work, or am I missing something?

The problem with this is that it still leaves holes in the pindex space, just fewer, larger, holes.

bnovkov marked 2 inline comments as done.

Address @markj 's comments, additional kstack object for non-standard kstack sizes.

Does that work, or am I missing something?

The problem with this is that it still leaves holes in the pindex space, just fewer, larger, holes.

I went through the scheme you proposed, did a few dry run calculations, and reached the same conclusion. I think the problem lies in the fact that this is still a linear transformation, but I'll think about it some more to see if I accidentally missed something.

I think the requirement that the kstack KVA base be a multiple of kstack_pages + KSTACK_GUARD_PAGES is somewhat challenging to handle.

Do you have any specific problems in mind?
The only thing that comes to mind is that importing of non-power of 2 sized KVA chunks into the kstack_arena (especially with the added padding) might exacerbate KVA space fragmentation.
However, this can be remedied by importing a power-of-2 sized chunk, and trimming it to fit the aforementioned kstack size requirement. The tradeoff in this case is that we are no longer able to pack the kstacks in a single superpage reservation.

sys/vm/vm_glue.c
314

Yes, we rely on that assumption here due to the adjustment in the kstack arena import function.
Unfortunately yes, this scheme leaks across the vmem abstraction layer due to the alignment requirements, but I think that we can be completely sure that vmem will satisfy this assumption since all imported chunks are properly aligned and the kstack_arena only allocates kstack-sized chunks.

354

Agreed, I've changed the assertion.

355

Thank you, I honestly dont know why I haven't used __func__ from the start.

473

It does, and this was an oversight on my part. I've changed the vm_kstack_pindex to return the linear pindex for the pages != kstack_pages case.

As for the other point, you are correct, a pindex collision may occur. To solve this I've added a new object (kstack_alt_object) as you proposed and adjusted all existing code that uses kstack_object directly to use the appropriate object depending on the kstack size.

I did some light testing. The effects of the patch are noticeable when I look at vm.pmap.kernel_maps output - kernel stacks are now more segregated than they were before. This is close to being ready, but I see one more issue to deal with, specific to NUMA systems.

The kstack allocator is NUMA-aware. It implements a first-touch policy: it tries to return a kstack allocated from the NUMA domain for the current CPU. The domain parameter to kstack_import() is converted to a "domainset", which is used to allocate physical pages.

So, what happens if vm_page_alloc_domainset() specifies a domain, but the pindex corresponds to an existing reservation from a different domain? It turns out that the page allocator will silently return pages from the reservation even though they belong to the "wrong" domain. (In fact, if the object is kernel_object, this results in a kernel panic.)

One way to handle this is to have a KVA arena per NUMA domain. This is how kmem_malloc() and friends in vm_kern.c are implemented. They use vm_domainset_iter_ to select a domain from the domainset, then use that domain to 1) select the arena for a KVA allocation, 2) select the source domain for physical pages. I think this approach could work for kstacks as well. I'd just create an array of MAXMEMDOM vmem arena pointers, then at boot time allocate an arena per NUMA domain. Each arena would import from the main kernel arena, just like the kstack arena does now. What do you think?

In D38852#888431, @bojan.novkovic_fer.hr wrote:

The only thing that comes to mind is that importing of non-power of 2 sized KVA chunks into the kstack_arena (especially with the added padding) might exacerbate KVA space fragmentation.
However, this can be remedied by importing a power-of-2 sized chunk, and trimming it to fit the aforementioned kstack size requirement. The tradeoff in this case is that we are no longer able to pack the kstacks in a single superpage reservation.

Can we simply import KVA_QUANTUM * (kstack_pages + KSTACK_GUARD_PAGES) KVA at a time? Am I misremembering, or did an earlier version of the patch do exactly that?

sys/vm/vm_glue.c
278

I think these objects can be defined static now.

314

Yeah, this is probably a reasonable assumption, but let's add an assertion to verify that.

607

Please keep an extra line between variable declarations and the rest of the function.

bnovkov marked 2 inline comments as done.

Address @markj 's comments - per-domain kstack arenas + KVA import size adjustment

One way to handle this is to have a KVA arena per NUMA domain. This is how kmem_malloc() and friends in vm_kern.c are implemented. They use vm_domainset_iter_ to select a domain from the domainset, then use that domain to 1) select the arena for a KVA allocation, 2) select the source domain for physical pages. I think this approach could work for kstacks as well. I'd just create an array of MAXMEMDOM vmem arena pointers, then at boot time allocate an arena per NUMA domain. Each arena would import from the main kernel arena, just like the kstack arena does now. What do you think?

I agree, adding per-domain kstack arenas is a good solution. I've implemented this, all kstack KVA allocations will now go through the appropriate domain arena.
I also did some basic testing (buildworld) on a VM with a simulated NUMA configuration and encountered no errors so far.

In D38852#888431, @bojan.novkovic_fer.hr wrote:

The only thing that comes to mind is that importing of non-power of 2 sized KVA chunks into the kstack_arena (especially with the added padding) might exacerbate KVA space fragmentation.
However, this can be remedied by importing a power-of-2 sized chunk, and trimming it to fit the aforementioned kstack size requirement. The tradeoff in this case is that we are no longer able to pack the kstacks in a single superpage reservation.

Can we simply import KVA_QUANTUM * (kstack_pages + KSTACK_GUARD_PAGES) KVA at a time? Am I misremembering, or did an earlier version of the patch do exactly that?

Yes, we can. I've changed the import size and adjusted the size used by the kstack arena to make alignment adjustments possible.

bnovkov edited the summary of this revision. (Show Details)
bnovkov edited the test plan for this revision. (Show Details)
bnovkov added reviewers: kib, alc.

So what is the main purpose of this change? To get linear, without gaps, pindexes for the kernel stack obj?

sys/vm/vm_glue.c
315

ptoa()

427

ptoa(npages), or () are excessive

459
498

Would be useful to assert that all pages are from the same domain.

583

May be I do not completely understand the calculations, but to me it sounds like the formulas depend on KSTACK_GUARD_PAGES == 1. In particular, both this KASSERT, and the (required?) property that returned non-linear pindex is unique for pindex.

Could be that some elaborated explanation would help.

586
In D38852#956753, @kib wrote:

So what is the main purpose of this change? To get linear, without gaps, pindexes for the kernel stack obj?

The physical pages for the kstack object come from reservations. Today, those reservations will never be fully populated because of the gaps. Ultimately, those reservations will get broken, and we'll recover the unused pages. However, because of the gaps, we are allocating 25% more contiguous physical memory than really necessary, assuming a 16KB stack and 4KB guard.

From Ryzen 3000 onward, AMD's MMU automatically promotes four contiguous 16KB-aligned virtual pages mapping four contiguous 16KB-aligned physical pages, creating a single 16KB TLB entry. With this change, the physical contiguity and alignment is guaranteed. Could we dynamically adjust the guard size at runtime to achieve the required virtual alignment?

One other open issue that we didn't discuss much yet is what to do on 32-bit systems. Since the kernel virtual address space is much more limited there, it perhaps (probably?) doesn't make much sense to change the existing scheme. This is especially true if we start trying to align virtual mappings of stacks as Alan suggests.

In D38852#956753, @kib wrote:

So what is the main purpose of this change? To get linear, without gaps, pindexes for the kernel stack obj?

@alc already summed up the core idea behind this change, there is nothing to add aside from the added performance and security benefits described in the summary.

In D38852#956774, @alc wrote:

From Ryzen 3000 onward, AMD's MMU automatically promotes four contiguous 16KB-aligned virtual pages mapping four contiguous 16KB-aligned physical pages, creating a single 16KB TLB entry. With this change, the physical contiguity and alignment is guaranteed. Could we dynamically adjust the guard size at runtime to achieve the required virtual alignment?

So, as @kib pointed out in the comments, the implicit assumption here is that there is only one guard page. However, I went back to the drawing board and adjusted the expression to work with an arbitrary number of guard pages.
I've attached a screenshot of the rendered formula since phabricator does not support inline LaTeX.

kstack_pindex_formula.png (84×1 px, 27 KB)

A guard page kva is a kva whose remainder modulo kstack_pages + KSTACK_GUARD_PAGES falls inside the following range: [kstack_pages, ..., kstack_pages + KSTACK_GUARD_PAGES - 1]
I'll work this new expression into the patch and address @kib 's comments within a couple of days.

As for the runtime tuning, I think its safe to change the number of guard pages before the kstack kva arenas are initialized, since their size and import quantum depend on that parameter.

I've updated the diff to work with an arbitrary number of guard pages but I didn't have time to test this with different guard page configurations - I'll do this in the coming days and report back.
The formula has been slightly reworked to account for the direction of stack growth (+ 1 to the result of lin_pindex(kva) / kstack_size and different condition for detecting guard pages).

Quick update: I've tested the patch with different numbers of kstack guard pages (1-4) and encountered no issues.
Each run consisted of running the test suite and a rebuild of the whole kernel.

sys/vm/vm_glue.c
326

Deindent.

327

This can be on the previous line.

329

Deindent by 4 positions.

385

Deindent by 4 positions.

412

Otherwise, this variable is reported as unused within a non-INVARIANTS kernel.

416

Deindent by 4 positions.

419

Deindent by 4 positions.

429

Deindent by 4 positions.

696

Deindent by 4 positions.

sys/vm/vm_glue.c
570

I think "identity" is more descriptive.

One other open issue that we didn't discuss much yet is what to do on 32-bit systems. Since the kernel virtual address space is much more limited there, it perhaps (probably?) doesn't make much sense to change the existing scheme. This is especially true if we start trying to align virtual mappings of stacks as Alan suggests.

Given the recent decision to deprecate and potentially remove support for 32-bit systems with 15.0, do you think that this is still an issue we should address?

sys/vm/vm_glue.c
570

Agreed.

In D38852#975886, @bojan.novkovic_fer.hr wrote:

One other open issue that we didn't discuss much yet is what to do on 32-bit systems. Since the kernel virtual address space is much more limited there, it perhaps (probably?) doesn't make much sense to change the existing scheme. This is especially true if we start trying to align virtual mappings of stacks as Alan suggests.

Given the recent decision to deprecate and potentially remove support for 32-bit systems with 15.0, do you think that this is still an issue we should address?

Yes - the actual removal of 32-bit kernels isn't going to happen anytime soon. 15.0 is two years out, and it's not entirely clear to me that we're ready to remove 32-bit ARM support.

But, can't we handle this simply by making vm_kstack_pindex() use the old KVA<->pindex mapping scheme if _ILP32 is defined?

sys/vm/vm_glue.c
579

Comment style puts /* and */ on their own lines. There are a few instances of this in the patch.

bnovkov marked an inline comment as done.

Address @markj 's comments:

  • multiline comment style fixes
  • use old mapping scheme on 32-bit systems
In D38852#975886, @bojan.novkovic_fer.hr wrote:

One other open issue that we didn't discuss much yet is what to do on 32-bit systems. Since the kernel virtual address space is much more limited there, it perhaps (probably?) doesn't make much sense to change the existing scheme. This is especially true if we start trying to align virtual mappings of stacks as Alan suggests.

Given the recent decision to deprecate and potentially remove support for 32-bit systems with 15.0, do you think that this is still an issue we should address?

Yes - the actual removal of 32-bit kernels isn't going to happen anytime soon. 15.0 is two years out, and it's not entirely clear to me that we're ready to remove 32-bit ARM support.

But, can't we handle this simply by making vm_kstack_pindex() use the old KVA<->pindex mapping scheme if _ILP32 is defined?

Yes, that and making sure that the KVAs are allocated from kernel_arena. I've updated the patch to fall back to the old mapping scheme when __ILP32__ is defined.

dougm added inline comments.
sys/vm/vm_glue.c
391

What 'npages' means here, and in the next function, conflicts with what npages means in vm_kstack_pindex, and so creates confusion for someone looking at the code for the first time. Can you pick a different name here (kpages?), and then use that name in vm_kstack_pindex for the same value?

sys/vm/vm_glue.c
405

Simplify to
*addrp = roundup(*addrp, npages * PAGE_SIZE);