Page MenuHomeFreeBSD

Use a single kernel stack object.
ClosedPublic

Authored by markj on Apr 17 2020, 5:26 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, May 3, 1:48 PM
Unknown Object (File)
Fri, May 3, 7:28 AM
Unknown Object (File)
Wed, May 1, 11:33 AM
Unknown Object (File)
Tue, Apr 30, 7:13 AM
Unknown Object (File)
Tue, Apr 30, 7:13 AM
Unknown Object (File)
Tue, Apr 30, 7:13 AM
Unknown Object (File)
Tue, Apr 30, 7:13 AM
Unknown Object (File)
Tue, Apr 30, 7:13 AM
Subscribers

Details

Summary

Previously we allocated a separate VM object for each kernel stack. Now
that we cache fully constructed kernel stacks, I can't see a reason for
keeping it that way: kernel stack allocation and free will avoid locking
the object most of the time, and kernel stack swapout/swapin are done by
dedicated threads, so there is very little concurrent access.

Add a global kstack_object, and use the stack KVA address to index into
the object.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/vm/vm_glue.c
324 ↗(On Diff #70702)

If we allocate KVA from the system, isn't it safe to assume that it is already unmapped? Are we making that assumption elsewhere?

sys/vm/vm_swapout.c
574 ↗(On Diff #70702)

If we're swapping in a stack, I think we can assume that all of the pages are invalid.

sys/vm/vm_glue.c
324 ↗(On Diff #70702)

I think it might be some kind of leftover from previous code for caching.

Do we have a way to assert that the address is unmapped in KVA ? I cannot think of any.

sys/vm/vm_swapout.c
574 ↗(On Diff #70702)

swapout unwires the pages, everything else is up to the pagedaemon. If pagedaemon did not reused the page, then it can be still valid.

sys/vm/vm_glue.c
324 ↗(On Diff #70702)

Can't we just call pmap_kextract() on each page in the guard page range and assert that the return value is 0?

sys/vm/vm_swapout.c
574 ↗(On Diff #70702)

Thanks, somehow I missed that case.

  • Remove XXX comments.
  • Assert that guard pages are already unmapped.
kib added inline comments.
sys/vm/vm_glue.c
324 ↗(On Diff #70702)

That assumes that physical address 0 is invalid for general use. For instance, on amd64 it was not true until L1TF mitigation was added several months ago.

This revision is now accepted and ready to land.Apr 20 2020, 2:43 PM
sys/vm/vm_glue.c
467 ↗(On Diff #70803)

This might as well be defined as OBJT_SWAP.

sys/vm/vm_glue.c
343 ↗(On Diff #70803)

Given the defined size of the object the second argument should be "atop(ks - VM_MIN_KERNEL_ADDRESS) + i"

  • Make kstack_object a swap object.
  • Fix kstack_object indexing by computing indices relative to VM_MIN_KERNEL_ADDRESS instead of 0. This is consistent with kernel_object.
  • Restore the code that explicitly unmaps guard pages. We cannot use pmap_kextract() for now: kib notes that 0 is technically a valid physical address (some implementations return 0 when no mapping is found though), and some implementations panic if the address is not mapped.
This revision now requires review to proceed.Apr 20 2020, 10:28 PM

Have you thought about whether there are any side-effects in swap behavior from using the same object? Might we run into clustering behavior?

sys/vm/vm_glue.c
428 ↗(On Diff #70823)

WAITFAIL is to get back to the pref domain? This seems like it will be incredibly rare. Is it even worth handling?

The overload of kstack policy is a bit odd but I guess that's the only way to do it at this layer.

In D24473#539261, @jeff wrote:

Have you thought about whether there are any side-effects in swap behavior from using the same object? Might we run into clustering behavior?

We place a gap of unmapped KVA at the "base" (i.e., the end) of each stack. This maps to holes at each end of each stack in the object's pindex space, which I think will prevent clustering of pageouts. That's not obviously a good thing: if multiple stacks are in the laundry queue, then we want to swap them out and I don't see the disadvantage in clustering that. wrt pageins, I think vm_thread_swapin() ensures that the swap pager will only read in the requested stack.

sys/vm/vm_glue.c
428 ↗(On Diff #70823)

At first I did not bother, but it is not really much extra work. It is probably rare in general, but if we are swapping out kernel stacks, swapins might be happening concurrently with a memory shortage.

I agree that specifying the policy this way is kind of awkward, I just didn't see a nicer way to do it.

431 ↗(On Diff #70823)

Ideally vm_page_grab_pages() would try to ensure that the request is satisfied by contiguous pages, but we don't yet have a good way of specifying that. Reservations don't help here because of the guard page. The individual page allocations will go to the per-CPU caches, which don't do a good job of returning contiguous memory. In practice, with plenty of free memory we do end up with many contiguous stacks, but small amounts of fragmentation are common, e.g.:

0x8a2a95000: vnode
0x8a2a96000: anonymous
0x8a2a97000: kernel-stack
0x8a2a98000: vnode
0x8a2a99000: kernel-stack
0x8a2a9a000: vnode
0x8a2a9b000: kernel-stack
0x8a2a9c000: kernel-stack
0x8a2a9d000: vnode
0x8a2a9e000: vnode
0x8a2a9f000: anonymous
0x8a2aa0000: kernel-stack

D14376 added a function that would be of use in this scenario.

This revision is now accepted and ready to land.Apr 21 2020, 3:56 AM
sys/vm/vm_glue.c
431 ↗(On Diff #70823)

It's not really correct to say that reservations don't help, it's just that they would never become fully populated.

I think reservations could be used if we introduce a layer of indirection in the stackaddr->obj pindex mapping. Rather than using atop(stackaddr) as the pindex, we could use a unr allocator or a vmem arena or whatever to allocate virtual indices within kstack_object. The virtual index for a stack mapped at a would be PHYS_TO_VM_PAGE(pmap_kextract(a))->pindex / kstack_pages. Then there would be no holes in the pindex space, so the use of reservations for kstack_object wouldn't suffer from the problem above. This works especially well when kstack_pages is a power of 2, which it is by default.

  • Fix up the pindex calculation in vm_thread_swapout().
This revision now requires review to proceed.Apr 23 2020, 6:33 PM
This revision was not accepted when it landed; it landed in state Needs Review.Apr 26 2020, 8:09 PM
This revision was automatically updated to reflect the committed changes.