Page MenuHomeFreeBSD

vm_map: Handle kernel map entry allocator recursion

Authored by markj on Oct 19 2020, 3:56 AM.



On platforms without a direct map, vm_map_insert() may in rare
situations need to allocate a kernel map entry in order to allocate
kernel map entries. This poses a problem similar to the one solved for
vmem boundary tags by vmem_bt_alloc(). In fact this problem is a bit
trickier since in the kernel map case we must allocate entries with the
kernel map locked, whereas vmem can recurse into itself because boundary
tags are allocated up-front. This diff tries to solve the problem.

The diff adds a custom slab allocator for kmapentzone which allocates
KVA directly from kernel_map, bypassing the kmem_ layer. This avoids
mutual recursion with the vmem btag allocator. Then, when
vm_map_insert() allocates a new kernel map entry, it avoids triggering
allocation of a new slab with M_NOVM until after the insertion is
complete. Instead, vm_map_insert() allocates from the reserve and sets
a flag in kernel_map to trigger re-population of the reserve just before
the map is unlocked.

I thought about a scheme for preallocating all of the KVA required for
kernel map entries during boot, like we do for radix nodes with
uma_zone_reserve_kva(). However, it's difficult to come up with a
reasonable upper bound for the number of kernel map entries that may be

Test Plan

We are testing amd64 without UMA_MD_SMALL_ALLOC defined and
hit this panic on a system in the netperf cluster:

I booted amd64 and i386 kernels with this change applied.

Diff Detail

rS FreeBSD src repository
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

markj requested review of this revision.Oct 19 2020, 3:56 AM
markj created this revision.
markj added reviewers: alc, kib, rlibby, jeff, dougm.
markj added a subscriber: andrew.

Mark kmapentzone as NOFREE.

200 ↗(On Diff #78408)

Could addr + bytes overflow ? I was unable to convince myself that it cannot.

789 ↗(On Diff #78408)

Do you need to handle MAP_REPLENISH there ?

200 ↗(On Diff #78408)

I think vm_map_findspace() is guaranteed to return an address with addr + bytes <= VM_MAX_KERNEL_ADDRESS.

246 ↗(On Diff #78408)

I can drop _NOFREE by adding a custom uma_freef implementation. _NOFREE is not needed when UMA_MD_SMALL_ALLOC is defined.

789 ↗(On Diff #78408)

I don't think so, but only because the map is locked using a mutex. If it changed, e.g., to a rwlock, then it would need to be handled, so I'll add handling now.

markj marked 2 inline comments as done.
  • Add a slab free function
  • Check for overflow after the vm_map_findspace() call
  • Add another check for MAP_REPLENISH
  • Fix an inverted check for UMA_SLAB_PRIV.
199–200 ↗(On Diff #78743)

This is all an aside, I think this usage here is correct according to vm_map_findspace's comment:

I think this interface is weird. It was not at first abundantly clear to me whether "max" is an inclusive or exclusive end point. I would tend to call inclusive "max" and exclusive "end". But I noticed e.g. that kmem_subinit seems to treat it as exclusive (*max = *min + size), which I think is an off-by-one. Also, why doesn't it simply return vm_map_max() on failure, or a special maximal value? The way it is currently written, the maximum offset cannot be allocated anyway (e.g., min=0, max=4095, then size is 4096, but if ret=0 for start=0 and length=4096, then 0+4096 > max=4095).

208–209 ↗(On Diff #78743)

I see we don't have any current users of M_ZERO but I think we should pass through wait & M_ZERO anyway to avoid surprise later.

212–213 ↗(On Diff #78743)


222–223 ↗(On Diff #78743)

Is this because we may see UMA_SLAB_BOOT pages?

265–266 ↗(On Diff #78743)

Is it necessary to prealloc more than the reserve? If there's more meaning here, I don't get it.

1006–1007 ↗(On Diff #78743)

Why demote the panic?

Thanks for taking a look.

212–213 ↗(On Diff #78743)

Needs to be vm_map_delete() to avoid potentially recursing on the kernel map lock.

I'm yet not sure if it's safe to hold the map lock across the kmem_back_domain() call.

222–223 ↗(On Diff #78743)

Right. In fact they are leaked in this case. I was thinking of lifting the UMA_SLAB_BOOT checks in page_alloc() and pcpu_page_alloc() into keg_free_slab(), but that feels a bit hacky and I don't think it's a major problem that we might leak slabs here.

265–266 ↗(On Diff #78743)

No, this can just prealloc the same number of items as the reserve.

There is a separate issue in that keg_reserve is a per-NUMA domain quantity, but uma_prealloc() preallocates exactly the requested number of items. We could make uma_prealloc() allocate the requested number of items for each domain, but that would blow up memory usage in at least one case, buf_trie_zone. Alternately, perhaps keg_fetch_slab() should be allowed to violate the first-touch policy in order to satisfy an M_USE_RESERVE allocation.

1006–1007 ↗(On Diff #78743)

Callers dereference the returned pointer immediately so we'll crash anyway, and it seemed unnecessary to check this in non-INVARIANTS kernels. I'm ok with restoring the old behaviour though.

markj marked an inline comment as done.

Address some feedback.

This revision was not accepted when it landed; it landed in state Needs Review.Wed, Nov 11, 6:56 PM
This revision was automatically updated to reflect the committed changes.