Page MenuHomeFreeBSD

Add a new bus method to fetch device-specific CPU sets.
ClosedPublic

Authored by jhb on Mar 2 2016, 12:48 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Dec 22, 7:33 PM
Unknown Object (File)
Tue, Dec 3, 10:15 PM
Unknown Object (File)
Mon, Dec 2, 3:47 AM
Unknown Object (File)
Sat, Nov 30, 10:15 PM
Unknown Object (File)
Nov 29 2024, 7:29 AM
Unknown Object (File)
Nov 28 2024, 7:52 PM
Unknown Object (File)
Nov 25 2024, 2:22 PM
Unknown Object (File)
Nov 23 2024, 6:39 PM

Details

Summary

Add a new bus method to fetch device-specific CPU sets.

bus_get_cpus() returns a specified set of CPUs for a device. It accepts
an enum for the second parameter that indicates the type of cpuset to
request. Currently two valus are supported:

  • LOCAL_CPUS (on x86 this returns all the CPUs in the package closest to the device when NUMA is enabled)
  • INTR_CPUS (like LOCAL_CPUS but only returns 1 SMT thread for each core)

For systems that do not support NUMA (or if it is not enabled in the kernel
config), LOCAL_CPUS fails with EINVAL. INTR_CPUS is mapped to 'all_cpus'
by default. The idea is that INTR_CPUS should always return a valid set.

The x86 nexus driver exposes the internal set of interrupt CPUs from the
the x86 interrupt code via INTR_CPUS.

The ACPI bus driver and PCI bridge drivers use _PXM to return a suitable
LOCAL_CPUS set when _PXM exists and NUMA is enabled. They also and the
global INTR_CPUS set from the nexus driver with the per-domain set from
_PXM to generate a local INTR_CPUS set for child devices.

Test Plan
  • Boot a test kernel both with and without NUMA on a dual-socket system and examine the values of the '%intr_cpus' and '%local_cpus' sysctls.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jhb retitled this revision from to Add a new bus method to fetch device-specific CPU sets..
jhb updated this object.
jhb edited the test plan for this revision. (Show Details)
jhb added reviewers: kib, kmacy, gallatin, glebius.
jhb added subscribers: scottl, benno, imp.

This is a more recently updated version of the patches Kip pulled out previously in D5210. I tested this today after updating it and adding the setsize argument. The '%intr_cpus' and '%local_cpus' nodes are just for testing. I don't plan on leaving them in the final patch unless people think they will be useful beyond simple testing.

Why return an error when NUMA is not configured ? Wouldn't returning the full cpu set make the callers to handle less special cases ?

The question is only because I do not understand the newbus plumbing. Is it possible for the method to be overloaded for the direct acpi children ? In particular, I am interested in the non-PCI north bridge inhabitants like IOATs and DMARs which do have proximity specified, but do not have ACPI device object at all.

In D5519#118415, @kib wrote:

Why return an error when NUMA is not configured ? Wouldn't returning the full cpu set make the callers to handle less special cases ?

The question is only because I do not understand the newbus plumbing. Is it possible for the method to be overloaded for the direct acpi children ? In particular, I am interested in the non-PCI north bridge inhabitants like IOATs and DMARs which do have proximity specified, but do not have ACPI device object at all.

This has been a pet peeve of mine. Even if NUMA is not configured you
may end up on a multi-package system and it will provide you with
useful information. Please don't make it require MAXMEMDOM > 1 as it
does now. Scott and I consider that a bug.

Thanks.

In D5519#118491, @kmacy wrote:
In D5519#118415, @kib wrote:

Why return an error when NUMA is not configured ? Wouldn't returning the full cpu set make the callers to handle less special cases ?

The question is only because I do not understand the newbus plumbing. Is it possible for the method to be overloaded for the direct acpi children ? In particular, I am interested in the non-PCI north bridge inhabitants like IOATs and DMARs which do have proximity specified, but do not have ACPI device object at all.

This has been a pet peeve of mine. Even if NUMA is not configured you
may end up on a multi-package system and it will provide you with
useful information. Please don't make it require MAXMEMDOM > 1 as it
does now. Scott and I consider that a bug.

Thanks.

Or alternatively this means we should turn MAXMEMDOM on by default. However, I agree that simply getting topology (including the device topology from bus_get_domain()) is useful on its own right independent of if the VM system tries to use a NUMA-aware allocator. I'll talk to Alan about divorcing the two concepts so we can enable topology info for devices by default.

In D5519#118415, @kib wrote:

Why return an error when NUMA is not configured ? Wouldn't returning the full cpu set make the callers to handle less special cases ?

Currently MAXMEMDOM is an all-or-nothing thing in that it enables both PCPU_GET(domain) and the mapping of _PXM to that as well as the memory allocator in the VM. Right now the latter is considered fragile enough to not enable by default. As I mentioned in my reply to Kip, the former is useful in its own right and we should probably have that always be enabled if possible. OTOH, even with it enabled we still have the problem that we use statically sized arrays still and if MAXMEMDOM is too small we are still going to end up failing. I've made INTR_CPUS always return something (as in my mind that's what most devices should actually use), but it seemed like LOCAL_CPUS should perhaps only return an answer if it has a real one to give.

The question is only because I do not understand the newbus plumbing. Is it possible for the method to be overloaded for the direct acpi children ? In particular, I am interested in the non-PCI north bridge inhabitants like IOATs and DMARs which do have proximity specified, but do not have ACPI device object at all.

Those also do not have a device_t object AFAIK, so they will have to talk to ACPI directly to do the mapping.

In D5519#118727, @jhb wrote:
In D5519#118415, @kib wrote:

The question is only because I do not understand the newbus plumbing. Is it possible for the method to be overloaded for the direct acpi children ? In particular, I am interested in the non-PCI north bridge inhabitants like IOATs and DMARs which do have proximity specified, but do not have ACPI device object at all.

Those also do not have a device_t object AFAIK, so they will have to talk to ACPI directly to do the mapping.

At least DMARs do have device_t allocated.

wblock added inline comments.
share/man/man9/BUS_GET_CPUS.9
65 ↗(On Diff #13960)

After identifying that this is a method above, it is somewhat redundant to keep referring to it as "The BUS_GET_CPUS method". I suggest it is easier to read as just

.Fn BUS_GET_CPUS
supports...

This is very likely the way we would say it when speaking. I have been told that this is Very Bad because of Reasons, but remain convinced that simpler is better, even in the face of dogma.

67 ↗(On Diff #13960)

Prefer "may" for permission. Maybe

Not all set types are supported for every device.
70 ↗(On Diff #13960)

Passive->active: s/will fail/fails/

72 ↗(On Diff #13960)

s/The following/These/

100 ↗(On Diff #13960)

s/functions/function/

jhb marked 5 inline comments as done.Mar 22 2016, 11:21 PM
jhb added inline comments.
share/man/man9/BUS_GET_CPUS.9
65 ↗(On Diff #13960)

I'm one of the dogma pushers but I will relent. :)

67 ↗(On Diff #13960)

Hmm, the only thing is that there may in fact be a system where every device does support all set types. I can't really think of a better alternative.

jhb marked 2 inline comments as done.Mar 22 2016, 11:23 PM
In D5519#118728, @kib wrote:
In D5519#118727, @jhb wrote:
In D5519#118415, @kib wrote:

The question is only because I do not understand the newbus plumbing. Is it possible for the method to be overloaded for the direct acpi children ? In particular, I am interested in the non-PCI north bridge inhabitants like IOATs and DMARs which do have proximity specified, but do not have ACPI device object at all.

Those also do not have a device_t object AFAIK, so they will have to talk to ACPI directly to do the mapping.

At least DMARs do have device_t allocated.

If you have a proximity value (from some other table, etc.) then you can certainly fix both bus_get_domain() and bus_get_cpus() for these devices to map that to a FreeBSD domain using the same method as for _PXM, though it needs to be done in the parent bus of these devices. What are they attached to now?

In D5519#121953, @jhb wrote:
In D5519#118728, @kib wrote:

At least DMARs do have device_t allocated.

If you have a proximity value (from some other table, etc.) then you can certainly fix both bus_get_domain() and bus_get_cpus() for these devices to map that to a FreeBSD domain using the same method as for _PXM, though it needs to be done in the parent bus of these devices. What are they attached to now?

AFAIU the plumbing, DMARs are children of the acpi0 right now, and devinfo confirms. The DMAR table provides the domain hints for preferred domain for translation tables. I believe, practically this is the same domain which is reported by _PXM for all downstream devices served by the given DMAR.

jhb edited edge metadata.

Update after DEVICE_NUMA changes committed.

With the recently committed changes to add DEVICE_NUMA and enable it by default on amd64, this should now handle NUMA on amd64 out-of-the-box. If needed we can still fix the parent bus of the dmar devices to return correct values as needed. If they are using the correct values for bus_get_domain() already via _PXM then they should already work now.

At this point I think this is a commit candidate modulo removing the 'XXX' debugging sysctl nodes.

In D5519#128182, @jhb wrote:

With the recently committed changes to add DEVICE_NUMA and enable it by default on amd64, this should now handle NUMA on amd64 out-of-the-box. If needed we can still fix the parent bus of the dmar devices to return correct values as needed. If they are using the correct values for bus_get_domain() already via _PXM then they should already work now.

Awesome. Thanks.

jhb edited edge metadata.
  • Merge branch 'master' into numa_bus_get_cpus
  • Merge branch 'master' into numa_bus_get_cpus
  • Restore updated prototypes lost in an earlier merge.
  • <sys/bus.h> now requires <sys/param.h> instead of just <sys/types.h>
  • Merge branch 'master' into numa_bus_get_cpus
This revision was automatically updated to reflect the committed changes.