Details

Reviewers

alc
kib
rlibby
jeff
dougm

Commits

rS367595: vm_map: Handle kernel map entry allocator recursion

Summary

On platforms without a direct map, vm_map_insert() may in rare
situations need to allocate a kernel map entry in order to allocate
kernel map entries. This poses a problem similar to the one solved for
vmem boundary tags by vmem_bt_alloc(). In fact this problem is a bit
trickier since in the kernel map case we must allocate entries with the
kernel map locked, whereas vmem can recurse into itself because boundary
tags are allocated up-front. This diff tries to solve the problem.

The diff adds a custom slab allocator for kmapentzone which allocates
KVA directly from kernel_map, bypassing the kmem_ layer. This avoids
mutual recursion with the vmem btag allocator. Then, when
vm_map_insert() allocates a new kernel map entry, it avoids triggering
allocation of a new slab with M_NOVM until after the insertion is
complete. Instead, vm_map_insert() allocates from the reserve and sets
a flag in kernel_map to trigger re-population of the reserve just before
the map is unlocked.

I thought about a scheme for preallocating all of the KVA required for
kernel map entries during boot, like we do for radix nodes with
uma_zone_reserve_kva(). However, it's difficult to come up with a
reasonable upper bound for the number of kernel map entries that may be
required.

Test Plan

We are testing amd64 without UMA_MD_SMALL_ALLOC defined and
hit this panic on a system in the netperf cluster: https://reviews.freebsd.org/P442

I booted amd64 and i386 kernels with this change applied.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

markj requested review of this revision.Oct 19 2020, 3:56 AM

markj created this revision.

Harbormaster completed remote builds in B34251: Diff 78407.Oct 19 2020, 3:56 AM

markj edited the test plan for this revision. (Show Details)Oct 19 2020, 3:59 AM

markj added reviewers: alc, kib, rlibby, jeff, dougm.

markj added a subscriber: andrew.

Mark kmapentzone as NOFREE.

Harbormaster completed remote builds in B34252: Diff 78408.Oct 19 2020, 4:03 AM

kib added inline comments.Oct 22 2020, 12:41 PM

sys/vm/vm_map.c
200 ↗	(On Diff #78408)	Could addr + bytes overflow ? I was unable to convince myself that it cannot.
789 ↗	(On Diff #78408)	Do you need to handle MAP_REPLENISH there ?

markj added inline comments.Oct 22 2020, 1:23 PM

sys/vm/vm_map.c
200 ↗	(On Diff #78408)	I think vm_map_findspace() is guaranteed to return an address with `addr + bytes <= VM_MAX_KERNEL_ADDRESS`.
246 ↗	(On Diff #78408)	I can drop _NOFREE by adding a custom uma_freef implementation. _NOFREE is not needed when UMA_MD_SMALL_ALLOC is defined.
789 ↗	(On Diff #78408)	I don't think so, but only because the map is locked using a mutex. If it changed, e.g., to a rwlock, then it would need to be handled, so I'll add handling now.

Add a slab free function
Check for overflow after the vm_map_findspace() call
Add another check for MAP_REPLENISH

Harbormaster completed remote builds in B34345: Diff 78663.Oct 23 2020, 10:13 PM

Fix an inverted check for UMA_SLAB_PRIV.

Harbormaster completed remote builds in B34380: Diff 78743.Oct 25 2020, 4:55 PM

emaste added a subscriber: emaste.Oct 30 2020, 3:08 PM

markj added a subscriber: pho.Oct 30 2020, 3:13 PM

rlibby added inline comments.Oct 30 2020, 8:20 PM

sys/vm/vm_map.c
199–200 ↗	(On Diff #78743)	This is all an aside, I think this usage here is correct according to vm_map_findspace's comment: I think this interface is weird. It was not at first abundantly clear to me whether "max" is an inclusive or exclusive end point. I would tend to call inclusive "max" and exclusive "end". But I noticed e.g. that `kmem_subinit` seems to treat it as exclusive (`max = min + size`), which I think is an off-by-one. Also, why doesn't it simply return vm_map_max() on failure, or a special maximal value? The way it is currently written, the maximum offset cannot be allocated anyway (e.g., min=0, max=4095, then size is 4096, but if ret=0 for start=0 and length=4096, then 0+4096 > max=4095).
208–209 ↗	(On Diff #78743)	I see we don't have any current users of M_ZERO but I think we should pass through `wait & M_ZERO` anyway to avoid surprise later.
212–213 ↗	(On Diff #78743)	vm_map_remove()?
222–223 ↗	(On Diff #78743)	Is this because we may see UMA_SLAB_BOOT pages?
265–266 ↗	(On Diff #78743)	Is it necessary to prealloc more than the reserve? If there's more meaning here, I don't get it.
1006–1007 ↗	(On Diff #78743)	Why demote the panic?

Thanks for taking a look.

sys/vm/vm_map.c
212–213 ↗	(On Diff #78743)	Needs to be vm_map_delete() to avoid potentially recursing on the kernel map lock. I'm yet not sure if it's safe to hold the map lock across the kmem_back_domain() call.
222–223 ↗	(On Diff #78743)	Right. In fact they are leaked in this case. I was thinking of lifting the UMA_SLAB_BOOT checks in page_alloc() and pcpu_page_alloc() into keg_free_slab(), but that feels a bit hacky and I don't think it's a major problem that we might leak slabs here.
265–266 ↗	(On Diff #78743)	No, this can just prealloc the same number of items as the reserve. There is a separate issue in that keg_reserve is a per-NUMA domain quantity, but uma_prealloc() preallocates exactly the requested number of items. We could make uma_prealloc() allocate the requested number of items for each domain, but that would blow up memory usage in at least one case, buf_trie_zone. Alternately, perhaps keg_fetch_slab() should be allowed to violate the first-touch policy in order to satisfy an M_USE_RESERVE allocation.
1006–1007 ↗	(On Diff #78743)	Callers dereference the returned pointer immediately so we'll crash anyway, and it seemed unnecessary to check this in non-INVARIANTS kernels. I'm ok with restoring the old behaviour though.