Page MenuHomeFreeBSD

vm_map: Handle kernel map entry allocator recursion

Authored by markj on Oct 19 2020, 3:56 AM.
Referenced Files
Fri, Apr 19, 12:25 AM
F81614706: D26851.id79439.diff
Fri, Apr 19, 12:08 AM
F81614459: D26851.id78743.diff
Fri, Apr 19, 12:03 AM
F81613219: D26851.id78407.diff
Thu, Apr 18, 11:33 PM
F81612686: D26851.id78408.diff
Thu, Apr 18, 11:22 PM
F81611741: D26851.id78990.diff
Thu, Apr 18, 11:01 PM
F81611514: D26851.id78663.diff
Thu, Apr 18, 10:57 PM
Unknown Object (File)
Feb 27 2024, 12:39 AM



On platforms without a direct map, vm_map_insert() may in rare
situations need to allocate a kernel map entry in order to allocate
kernel map entries. This poses a problem similar to the one solved for
vmem boundary tags by vmem_bt_alloc(). In fact this problem is a bit
trickier since in the kernel map case we must allocate entries with the
kernel map locked, whereas vmem can recurse into itself because boundary
tags are allocated up-front. This diff tries to solve the problem.

The diff adds a custom slab allocator for kmapentzone which allocates
KVA directly from kernel_map, bypassing the kmem_ layer. This avoids
mutual recursion with the vmem btag allocator. Then, when
vm_map_insert() allocates a new kernel map entry, it avoids triggering
allocation of a new slab with M_NOVM until after the insertion is
complete. Instead, vm_map_insert() allocates from the reserve and sets
a flag in kernel_map to trigger re-population of the reserve just before
the map is unlocked.

I thought about a scheme for preallocating all of the KVA required for
kernel map entries during boot, like we do for radix nodes with
uma_zone_reserve_kva(). However, it's difficult to come up with a
reasonable upper bound for the number of kernel map entries that may be

Test Plan

We are testing amd64 without UMA_MD_SMALL_ALLOC defined and
hit this panic on a system in the netperf cluster:

I booted amd64 and i386 kernels with this change applied.

Diff Detail

Lint Passed
No Test Coverage
Build Status
Buildable 34251
Build 31395: arc lint + arc unit

Event Timeline

markj requested review of this revision.Oct 19 2020, 3:56 AM
markj created this revision.
markj added reviewers: alc, kib, rlibby, jeff, dougm.
markj added a subscriber: andrew.

Mark kmapentzone as NOFREE.


Could addr + bytes overflow ? I was unable to convince myself that it cannot.


Do you need to handle MAP_REPLENISH there ?


I think vm_map_findspace() is guaranteed to return an address with addr + bytes <= VM_MAX_KERNEL_ADDRESS.


I can drop _NOFREE by adding a custom uma_freef implementation. _NOFREE is not needed when UMA_MD_SMALL_ALLOC is defined.


I don't think so, but only because the map is locked using a mutex. If it changed, e.g., to a rwlock, then it would need to be handled, so I'll add handling now.

markj marked 2 inline comments as done.
  • Add a slab free function
  • Check for overflow after the vm_map_findspace() call
  • Add another check for MAP_REPLENISH
  • Fix an inverted check for UMA_SLAB_PRIV.

This is all an aside, I think this usage here is correct according to vm_map_findspace's comment:

I think this interface is weird. It was not at first abundantly clear to me whether "max" is an inclusive or exclusive end point. I would tend to call inclusive "max" and exclusive "end". But I noticed e.g. that kmem_subinit seems to treat it as exclusive (*max = *min + size), which I think is an off-by-one. Also, why doesn't it simply return vm_map_max() on failure, or a special maximal value? The way it is currently written, the maximum offset cannot be allocated anyway (e.g., min=0, max=4095, then size is 4096, but if ret=0 for start=0 and length=4096, then 0+4096 > max=4095).


I see we don't have any current users of M_ZERO but I think we should pass through wait & M_ZERO anyway to avoid surprise later.




Is this because we may see UMA_SLAB_BOOT pages?


Is it necessary to prealloc more than the reserve? If there's more meaning here, I don't get it.


Why demote the panic?

Thanks for taking a look.


Needs to be vm_map_delete() to avoid potentially recursing on the kernel map lock.

I'm yet not sure if it's safe to hold the map lock across the kmem_back_domain() call.


Right. In fact they are leaked in this case. I was thinking of lifting the UMA_SLAB_BOOT checks in page_alloc() and pcpu_page_alloc() into keg_free_slab(), but that feels a bit hacky and I don't think it's a major problem that we might leak slabs here.


No, this can just prealloc the same number of items as the reserve.

There is a separate issue in that keg_reserve is a per-NUMA domain quantity, but uma_prealloc() preallocates exactly the requested number of items. We could make uma_prealloc() allocate the requested number of items for each domain, but that would blow up memory usage in at least one case, buf_trie_zone. Alternately, perhaps keg_fetch_slab() should be allowed to violate the first-touch policy in order to satisfy an M_USE_RESERVE allocation.


Callers dereference the returned pointer immediately so we'll crash anyway, and it seemed unnecessary to check this in non-INVARIANTS kernels. I'm ok with restoring the old behaviour though.

markj marked an inline comment as done.

Address some feedback.

This revision was not accepted when it landed; it landed in state Needs Review.Nov 11 2020, 6:56 PM
This revision was automatically updated to reflect the committed changes.