Page MenuHomeFreeBSD

Back pcpu zone with domain correct pages
ClosedPublic

Authored by mmacy on Jun 20 2018, 7:12 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jan 13, 1:06 PM
Unknown Object (File)
Sun, Jan 12, 8:07 PM
Unknown Object (File)
Fri, Jan 10, 6:38 PM
Unknown Object (File)
Tue, Jan 7, 7:23 PM
Unknown Object (File)
Tue, Jan 7, 4:38 PM
Unknown Object (File)
Sun, Jan 5, 12:38 PM
Unknown Object (File)
Sun, Jan 5, 12:33 PM
Unknown Object (File)
Sun, Jan 5, 12:28 PM

Details

Summary
  • Change pcpu zone consumers to use a stride size of PAGE_SIZE (defined as UMA_PCPU_ZONE_SIZE to make future identification easier)
  • allocate page from the correct domain for a given cpu

The former slab size of sizeof(struct pcpu) was somewhat arbitrary. The new value is PAGE_SIZE because that's the smallest granularity which the VM can allocate a slab for a given domain. If you have fewer than PAGE_SIZE/8 counters on your system there will be some memory wasted, but this is obviously something where you want the cache line to be coming from the correct domain.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

  • make zeroing inline in the alloc path
  • check that pc_domain is valid
  • don't set pc_domain in the non NUMA case on x86
  • retry from domain zero if allocation fails from a non-zero domain (handling hw.physmem reduction)
sys/kern/subr_counter.c
57 ↗(On Diff #44699)

@glebius Why do you use IPIs for zeroing?

Some cosmetic stuff but I'm happy with this patch. If you fix those issues I approve.

sys/sys/pcpu.h
221 ↗(On Diff #44699)

Why the comment if it's defined above? Does this need to be an assert anywhere?

sys/vm/uma.h
47 ↗(On Diff #44699)

redundant declaration?

284 ↗(On Diff #44699)

slabs of UMA_PCPU_ZONE_SIZE

sys/vm/uma_core.c
1193 ↗(On Diff #44699)

In this case these should probably just be SYSTEM. These pages are not critical for pageout for example.

1383 ↗(On Diff #44699)

UMA_PCPU_ZONE_SIZE

sys/x86/acpica/srat.c
539 ↗(On Diff #44699)

Is this required with the patch because we're adding new consumers? Does subr_smp.c not initialize all pcpu fields to 0?

sys/sys/pcpu.h
221 ↗(On Diff #44699)

I moved it without removing the comment.

sys/vm/uma.h
47 ↗(On Diff #44699)

Yes.

284 ↗(On Diff #44699)

OK

sys/x86/acpica/srat.c
539 ↗(On Diff #44699)

Yes. Yes. I like to have the assignment just for symmetry.

  • use mnemonic when setting slab size
sys/sys/pcpu.h
206 ↗(On Diff #44703)

The name is kind of confusing to me since this is really the allocation size, not the size of a zone. Maybe UMA_PCPU_ALLOC_SIZE?

sys/vm/uma_core.c
1181 ↗(On Diff #44703)

This is called struct pglist elsewhere.

1196 ↗(On Diff #44703)

Why specify a domain at all in this case?

1203 ↗(On Diff #44703)

What's the point of trying the allocation twice?

sys/sys/pcpu.h
206 ↗(On Diff #44703)

OK.

sys/vm/uma_core.c
1181 ↗(On Diff #44703)

Will fix, but note that this is copy pasta from alloc_noobj.

1196 ↗(On Diff #44703)

Fair point.

1203 ↗(On Diff #44703)

If you look at the condition you'll note that the second allocation was intended to request from domain 0 and I slipped. Setting hw.physmem will let you create a situation with multiple sockets where there's only memory in domain 0. I imagine that there are situations where the fallback should be somewhere else but I think this covers the most common cases. The PowerPC port also seems to have weird issues here. It may simply be that they're setting pc_domain != 0 without NUMA defined.

  • rename slab size
  • don't reference domain on cpu absent or alloc failure
sys/vm/uma_core.c
1203 ↗(On Diff #44703)

arm64 will do that as well. See cpu_init_fdt(). The memory locality info is in a different subtree from the CPU info which includes a NUMA node ID. The assertion you added in the #ifndef NUMA path will fail on, e.g., ThunderX.

sys/vm/uma_core.c
1203 ↗(On Diff #44703)

On a dual package ThunderX some interrupts must be handled on the same package as they were sent. Because of this we need to know which package a given CPU is on, so set the CPUs domain unconditionally.

I would argue that conflating VM domain with sockets is not robust. However, remove the assert we'll cope with that elsewhere.

sys/vm/uma_core.c
1203 ↗(On Diff #44703)

We really should change arm64. The VM assumes pc_domain is the memory domain. It sounds like this is being overloaded to mean 'package' or 'interrupt controller domain' or something similar. It's going to prevent arm64 from ever enabling NUMA and I'd rather not have workarounds for specific architecture pc_domain quirks.

Andrew, is that something you can do?

sys/vm/uma_core.c
1203 ↗(On Diff #44703)

Just as importantly as enabling NUMA on arm64, this places an extra burden on all consumers to validate pc_domain for use with the VM. I already have hacks in hwpmc and epoch to workaround invalid pc_domain values. As domain specific allocation becomes more common there will need to be more hacking around this. md_pcpu is where to put fields with distinct meanings.

sys/vm/uma_core.c
1203 ↗(On Diff #44703)

The pcpu domain on arm64 is the NUMA domain. We read it from the CPUs numa-node-id property as this is where the NUMA node ID for the CPU is stored. Note, this is not the package ID, or 'interrupt controller domain'. It is the NUMA domain [1].

In the GICv3 driver we also read the numa-node-id property to tell us which NUMA domain it is attached to. This is because it is a memory mapped device, so the IO bus will be attached to a NUMA domain. This is still not a package ID, or 'interrupt controller domain' as it is the domain the memory bus is connected directly to.

While in theory we could route interrupts to any CPU the overhead of sending them to a different NUMA domain could be high so it makes sense to ensure they are sent to the correct domain. This also helps with an erratum where MSI/MSI-X interrupts cannot be sent between packages on ThunderX. It happens that a NUMA domain and a package are identical there, however this is not assumed to be the case.

In short pc_domain on arm64 already holds the NUMA domain, not the package ID, or 'interrupt controller domain'.

[1] https://github.com/torvalds/linux/blob/master/Documentation/devicetree/bindings/numa.txt

sys/vm/uma_core.c
1203 ↗(On Diff #44703)

As was clearly stated this field is for use by the VM system. If NUMA is not defined a non-zero value is invalid. If you want something for your arch use, add it to md_pcpu.

jhb added inline comments.
sys/vm/uma_core.c
1203 ↗(On Diff #44703)

If it is the real numa id, why can't arm64 turn on options NUMA and populate the physical segments table? It's not a lot of code to populate that table (only the size of the SRAT parser).

sys/vm/uma_core.c
1203 ↗(On Diff #44703)

So does arm not number domains from 0? We are using a hardware value? So when NUMA is disabled the domain is non-zero? Sort of like apic id vs cpuid on intel?

The VM only supports domain numbers up to MAXMEMDOM. While they aren't required to be densely populated it is assumed that without NUMA specified 0 is the only valid domain.

x86 also routes interrupts to the correct domain. I wasn't suggesting we do otherwise. I'm just trying to understand what is generating the special case so we can simplify consumers of pc_domain.

I would like to keep all cores set in cpuset_domain[0] in the !NUMA case so there are no surprises. Other than that this looks good to go.

This revision is now accepted and ready to land.Jul 6 2018, 1:04 AM
This revision was automatically updated to reflect the committed changes.