Page MenuHomeFreeBSD

vm_map: Handle kernel map entry allocator recursion
ClosedPublic

Authored by markj on Oct 19 2020, 3:56 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Feb 27, 12:39 AM
Unknown Object (File)
Jan 10 2024, 1:18 PM
Unknown Object (File)
Dec 26 2023, 1:12 AM
Unknown Object (File)
Dec 26 2023, 1:12 AM
Unknown Object (File)
Dec 26 2023, 1:12 AM
Unknown Object (File)
Dec 26 2023, 1:12 AM
Unknown Object (File)
Dec 20 2023, 4:33 AM
Unknown Object (File)
Dec 16 2023, 9:07 AM

Details

Summary

On platforms without a direct map, vm_map_insert() may in rare
situations need to allocate a kernel map entry in order to allocate
kernel map entries. This poses a problem similar to the one solved for
vmem boundary tags by vmem_bt_alloc(). In fact this problem is a bit
trickier since in the kernel map case we must allocate entries with the
kernel map locked, whereas vmem can recurse into itself because boundary
tags are allocated up-front. This diff tries to solve the problem.

The diff adds a custom slab allocator for kmapentzone which allocates
KVA directly from kernel_map, bypassing the kmem_ layer. This avoids
mutual recursion with the vmem btag allocator. Then, when
vm_map_insert() allocates a new kernel map entry, it avoids triggering
allocation of a new slab with M_NOVM until after the insertion is
complete. Instead, vm_map_insert() allocates from the reserve and sets
a flag in kernel_map to trigger re-population of the reserve just before
the map is unlocked.

I thought about a scheme for preallocating all of the KVA required for
kernel map entries during boot, like we do for radix nodes with
uma_zone_reserve_kva(). However, it's difficult to come up with a
reasonable upper bound for the number of kernel map entries that may be
required.

Test Plan

We are testing amd64 without UMA_MD_SMALL_ALLOC defined and
hit this panic on a system in the netperf cluster: https://reviews.freebsd.org/P442

I booted amd64 and i386 kernels with this change applied.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 34252
Build 31396: arc lint + arc unit

Event Timeline

markj requested review of this revision.Oct 19 2020, 3:56 AM
markj created this revision.
markj added reviewers: alc, kib, rlibby, jeff, dougm.
markj added a subscriber: andrew.

Mark kmapentzone as NOFREE.

sys/vm/vm_map.c
200

Could addr + bytes overflow ? I was unable to convince myself that it cannot.

789

Do you need to handle MAP_REPLENISH there ?

sys/vm/vm_map.c
200

I think vm_map_findspace() is guaranteed to return an address with addr + bytes <= VM_MAX_KERNEL_ADDRESS.

246

I can drop _NOFREE by adding a custom uma_freef implementation. _NOFREE is not needed when UMA_MD_SMALL_ALLOC is defined.

789

I don't think so, but only because the map is locked using a mutex. If it changed, e.g., to a rwlock, then it would need to be handled, so I'll add handling now.

markj marked 2 inline comments as done.
  • Add a slab free function
  • Check for overflow after the vm_map_findspace() call
  • Add another check for MAP_REPLENISH
  • Fix an inverted check for UMA_SLAB_PRIV.
sys/vm/vm_map.c
199–200

This is all an aside, I think this usage here is correct according to vm_map_findspace's comment:

I think this interface is weird. It was not at first abundantly clear to me whether "max" is an inclusive or exclusive end point. I would tend to call inclusive "max" and exclusive "end". But I noticed e.g. that kmem_subinit seems to treat it as exclusive (*max = *min + size), which I think is an off-by-one. Also, why doesn't it simply return vm_map_max() on failure, or a special maximal value? The way it is currently written, the maximum offset cannot be allocated anyway (e.g., min=0, max=4095, then size is 4096, but if ret=0 for start=0 and length=4096, then 0+4096 > max=4095).

208–209

I see we don't have any current users of M_ZERO but I think we should pass through wait & M_ZERO anyway to avoid surprise later.

212–213

vm_map_remove()?

222–223

Is this because we may see UMA_SLAB_BOOT pages?

249–250

Is it necessary to prealloc more than the reserve? If there's more meaning here, I don't get it.

987–988

Why demote the panic?

Thanks for taking a look.

sys/vm/vm_map.c
212–213

Needs to be vm_map_delete() to avoid potentially recursing on the kernel map lock.

I'm yet not sure if it's safe to hold the map lock across the kmem_back_domain() call.

222–223

Right. In fact they are leaked in this case. I was thinking of lifting the UMA_SLAB_BOOT checks in page_alloc() and pcpu_page_alloc() into keg_free_slab(), but that feels a bit hacky and I don't think it's a major problem that we might leak slabs here.

249–250

No, this can just prealloc the same number of items as the reserve.

There is a separate issue in that keg_reserve is a per-NUMA domain quantity, but uma_prealloc() preallocates exactly the requested number of items. We could make uma_prealloc() allocate the requested number of items for each domain, but that would blow up memory usage in at least one case, buf_trie_zone. Alternately, perhaps keg_fetch_slab() should be allowed to violate the first-touch policy in order to satisfy an M_USE_RESERVE allocation.

987–988

Callers dereference the returned pointer immediately so we'll crash anyway, and it seemed unnecessary to check this in non-INVARIANTS kernels. I'm ok with restoring the old behaviour though.

markj marked an inline comment as done.

Address some feedback.

This revision was not accepted when it landed; it landed in state Needs Review.Nov 11 2020, 6:56 PM
This revision was automatically updated to reflect the committed changes.