Page MenuHomeFreeBSD

Introduce a dynamic pcpu layout on amd64.
AcceptedPublic

Authored by markj on May 8 2020, 12:21 AM.

Details

Summary

Currently we allocate a single 4KB page for each per-CPU structure and
place them in an array. UMA's dynamic per-CPU allocator returns arrays
of 4KB pages that mirror the base array, so per-CPU data can be
referenced relative to the location of the base pcpu structure.

There is no requirement that the base pcpu structures must be
contiguous, however. This commit adds some code to dynamically lay out
the base pcpu structures with the aim of using 2MB mappings for all
per-CPU data (base pcpu, DPCPU, and UMA slabs). This has the following
benefits:


1. Improved TLB efficiency. During boot a GENERIC kernel allocates
roughly 64KB of data per CPU, mostly for counter(9) and malloc(9)
stats. Currently all of that data is mapped using 4KB pages. The
amount of per-CPU data allocated during boot is only going to grow
over time, and some subsystems, e.g., pf, VFS, perform many dynamic
allocations of per-CPU data.

2. Better control over kernel memory fragmentation. Previously,
dynamically allocated per-CPU slabs were allocated directly from the
page allocator, so their placement is effectively random, and per-CPU
data structures tend to be long-lived.

3. The DPCPU indirection is removed. Previously, accessing a DPCPU
field involved an extra memory access. With this patch the DPCPU
region immediately follows the base pcpu structure, so its offset
relative to the base is known at compile-time.

4. Proper NUMA affinity for per-CPU structures allocated early during
boot. This allocator provides a bootstrap allocator which always
provides domain-correct memory. Previously, anything allocated
during SI_SUB_CPU or earlier would always be entirely on domain 0.

5. It allows the UMA per-CPU slab size limit of 4KB to be increased, if
that ever becomes useful.

CPUs are grouped by domain into 2MB pages. Initially, MAXCPU * NBPDR
bytes of KVA are reserved for per-CPU data, but this is a worst case and
in practice most of it is not used. Per-CPU data is always mapped into
ranges of 2MB pages. For example, with 4 CPUs and 2 domains, the
allocator always allocates 4MB of KVA at once. Usually the KVA quantum
will be 2MB * vm_ndomains, but it may be larger if there are many CPUs
in a domain. The initial segment for a given CPU looks like this:

| pcpu 0 | dpcpu 0 | UMA per-CPU slabs ... | pcpu 1 | dpcpu 1 | ...

Early during boot we allocate 2MB of physical memory for the BSP, mapped
at VM_MIN_KERNEL_ADDRESS. This is used to allocate the BSP's pcpu and
dpcpu regions. The rest is given to the UMA per-CPU bootstrap
allocator, which returns memory from this region.

During SI_SUB_CPU, after CPU IDs are fixed, pcpu_layout() computes the
addresses for the rest of the pcpu structures. It tries to pack in as
many CPUs as it can into each 2MB page. It also backs the rest of the
bootstrap region with 2MB physical pages if needed. After this point
UMA's pcpu allocator exits bootstrap mode, and starts using a vmem arena
to manage KVA for CPU 0.

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 31336
Build 28970: arc lint + arc unit

Event Timeline

markj requested review of this revision.May 8 2020, 12:21 AM
markj created this revision.
  • Add a vm_phys_seg[] entry for the bootstrap pcpu region.
  • When pcpu_layout() backs the rest of the bootstrap region with 2MB pages, it must subtract 2MB from the allocation size for domain 0, since it is already backed by the initial pcpu bootstrap allocation.
  • Rebase on D24755 to shrink the uma_core.c diff slightly.
  • Restore DPCPU indirection in kernel modules for now. The value of DPCPU_START is kld-specific, so DPCPU_BASE_OFFSET() gives the wrong value for dpcpu fields allocated from the kld region (the modspace field).
  • Fix a bug in uma_pcpu_init2(): when marking the bootstrap region as allocated, we have to allocate the region at the same granularity as it gets freed, i.e., a page at a time.
sys/amd64/amd64/mp_machdep.c
188

I believe this formula needs some comment and might be also a diagram explaining the layout.

200

So we assume that BSP is always in domain 0. Is it enforced somehow ?

sys/amd64/amd64/pmap.c
1417

Why uint64_t and not vm_offset_t or u_long ?

1695

I had to re-init cpuhead slist there, otherwise BSP appeared twice on it. How do you handle that ?

sys/amd64/include/pcpu_aux.h
44–45

I think that after your patch, this assert would better express the intent if you check that sizeof == PAGE_SIZE.

markj marked 2 inline comments as done.
  • Fix cpuhead initialization.
  • Weaken the amd64 assertion about sizeof(struct pcpu).
sys/amd64/amd64/mp_machdep.c
200

It is not enforced AFAIK. I can see two solutions

  • Modify renumber_domains() to ensure that the BSP's domain is 0.
  • Add some indirection in the layout calculation here: instead of having runs of 2MB pages:
| dom0 pcpu | dom1 pcpu | ... |

reorder them so that the BSP's domain always comes first. Once the layout is calculated, we do not require the domains to be in any particular order, it will just make pcpu_layout() more complicated.

Do you think it is useful to ensure that the BSP belongs to domain 0?

sys/amd64/amd64/pmap.c
1417

Just for consistency with allocpages() above.

1695

It is a bug in the patch, thanks.

sys/amd64/include/pcpu_aux.h
44–45

I think we just assume that the sizeof is a multiple of PAGE_SIZE. I updated the assertion.

markj added inline comments.
sys/amd64/amd64/mp_machdep.c
188

This can actually be simplified. I added a block comment above pcpu_layout().

markj marked an inline comment as done.
  • Simplify layout calculation.
  • Add a block comment above pcpu_layout().

Othwewise looks good.

sys/amd64/amd64/mp_machdep.c
200

I reviewed ACPI 6.3 spec, our algorithm to allocate map vm domains to ACPI SRAT domains (or reverse), and how Intel hw selects boot CPU. Basically, my belief is that the concern is real, since from what I know, Intel multi-socket hardware starts by executing the same BIOS code on designated cores on each socket, and then they select one winner among sockets. So in theory we might end up with BSP which domain is not the first domain in SRAT.

I suspect this does not happen only because current BIOSes reprogram home agents so that BSP socket' provides the lowest addresses in the phys memory map. But there seems to be no provision in the ACPI standard that would imply this.

I do not have a preference to the approaches you described, it is up to you as implementor. I am fine with whatever fits you.

This revision is now accepted and ready to land.May 23 2020, 7:17 PM

Handle the possibility that the BSP does not belong to domain 0.

This revision now requires review to proceed.May 27 2020, 3:09 PM
This revision is now accepted and ready to land.May 27 2020, 8:50 PM
sys/amd64/amd64/uma_machdep.c
138

Hmm, should this also do shuffling in style of pcpu_domidx() ?

sys/amd64/amd64/uma_machdep.c
138

I don't think so. Here we are just allocating 4KB independently for each CPU. The pcpu structures are already placed, so pc_domain gives the correct domain.