Page MenuHomeFreeBSD

Allocate the pcpu memory on a single level 2 page
Needs ReviewPublic

Authored by andrew on Apr 27 2022, 3:59 PM.

Details

Summary

We need to be careful to not promote or demote the address space
containing the per-CPU structures as the exception handlers will
dereference it so any time it's invalid may cause recursive exceptions.

To allow for the per-CPU memory to be allocated in the appropriate
domain use kmem_alloc_contig_domainset with a size and alignment
of the level 2 block. This will ensure the kernel will allocate the
full level 2 block, so won't promote after allocation.

Reported by: dch
Sponsored by: The FreeBSD Foundation

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 45399
Build 42287: arc lint + arc unit

Event Timeline

Does it make sense to add asserts in the demotion code to ensure that it does not touch pcpu mappings? Or even, somewhat more to the point, in the break-before-make code.

BTW, D24758 will also solve this problem while providing other benefits. It's amd64 only but I think it should be possible to make the implementation generic. There is some bootstrapping problem with UMA/SMR that needs to be fixed before it can land, but I forgot the details.

sys/arm64/arm64/mp_machdep.c
491

IMO this is a hack, especially given that it'll waste memory on small systems, so we definitely need a comment explaining why allocation is done this way.

540

bootstacks are still allocated directly, so this line should stay.

  • Free boot stacks on failure again
  • Add a comment to alloc_pcpu
  • Add an assertion to pmap_update_entry
sys/arm64/arm64/mp_machdep.c
494
503

To be clear, this change doesn't actually guarantee that the allocation is mapped by an L2 block (transparent superpages could be administratively disabled), and we're also assuming here that we'll never transparently use L1 blocks. So the approach really isn't ideal.

Taking a step back, I wonder if we can use IPIs to pause all other CPUs when pmap_update_entry() is promoting the L2 block containing pcpu pages?

sys/arm64/arm64/mp_machdep.c
503

Wouldn't IPI have the same problem: you need to ensure that IPI does not touch anything that could be broken for either promotion or demotion. So for instance we must ensure that pages containing global variables used by smp_rendezvous_cpus() are safe.

sys/arm64/arm64/mp_machdep.c
503

Hmm, can we do all of the work from the smp_rendezvous callback? That is, the initiator's callback looks like this:

while (atomic_load_acq_int(&spinning) != mp_ncpus - 1)
    cpu_spinwait();

pmap_clear_bits(pte, ATTR_DESCR_VALID);

pmap_invalidate_range(pmap, va, va + size, false);

pmap_store(pte, newpte);
dsb(ishst);

atomic_store_rel_int(&done, 1);

and on targets:

atomic_add_rel_int(&spinning, 1);
while (atomic_load_acq_int(&done) == 0)
    cpu_spinwait();

I think this would solve the problem you pointed out.

sys/arm64/arm64/mp_machdep.c
503

We need to execute the code to get into the callback? For instance, what if the demotion needs to occur for L2 page where smp_ipi_mtx is located?

Might be, there should be a dedicated IPI vector and dedicated L1 page with the spinning indicator (kind of barrier) that would allow to safely 'ground down' all other CPUs while current one is doing in-kernel promotion/demotion for specific unsafe places.

sys/arm64/arm64/mp_machdep.c
503

But with this approach all CPUs are "parked" while the L2 PTE is updated. Nothing will try to acquire the smp_ipi mutex during the window where the mapping is invalid. Maybe I'm missing something.

sys/arm64/arm64/mp_machdep.c
503

Ok, it is not smp_ipi mutex itself, but still a page containing some variable you need to re-check in the loop to detect the parking end.

I have a patch to use an IPI in pmap_update_entry. The variable to exit the loop is from the caller stack so would break exceptions in the same way if we promote/demote the page.

The main issue is deciding on which memory will need to use an IPI. I prototyped using a software flag in the pte, however this is likely to stop promotion as pmap_promote_l2 checks these attributes are identical over the entire range before promotion.

I have a patch to use an IPI in pmap_update_entry. The variable to exit the loop is from the caller stack so would break exceptions in the same way if we promote/demote the page.

The main issue is deciding on which memory will need to use an IPI. I prototyped using a software flag in the pte, however this is likely to stop promotion as pmap_promote_l2 checks these attributes are identical over the entire range before promotion.

Can you use a flag in vm_page_t, e.g. reserve PG_MACHDEP bit and mark pcpu-backing pages with it.

But in fact I do not quite understand your point about using sw-defined bit in pte. IMO if a page belongs to the super-pages which includes pcpu page, it must have this bit in pte set, because its promotion/demotion affects pcpu, no? Then the attributes become identical for whole run of the constituent small pages.

I've created D35434 as an alternative approach.