Discover CPU topology on multi-die AMD Zen systems
ClosedPublic
Actions

Authored by cem on Jul 31 2017, 9:49 PM.

Details

Reviewers

avg
markj
kib
truckman
jeff

Commits

rS322621: Discover CPU topology on multi-die AMD Zen systems

Summary

The Nodes per Processor topology information determines how many bits of
the APIC ID represent the Node (Zeppelin die, on Zen systems) ID.
Documented in Ryzen and Epyc Processor Programming Reference (PPR).

Test Plan

I am currently unable to test this patch, don't have hardware in front of me.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

cem created this revision.Jul 31 2017, 9:49 PM

Is there any practical benefit from adding this knowledge about the subdivision of the package into the nodes?
Is there any planned follow-up work?
I would prefer to not rush with this change especially given that the hardware is not quite wide-spread right now and there are some real problems with it.

Also, please note that there is already some form of the node detection that's used for [hopefully] correctly assigning L3 caches.

Please do not consider my note below as an objection against the patch, rather it is a generic question. As a side effect, it might affect the patch.

As I understand the current situation, only the topology reported by the corresponding CPUID leaf forms the hierarchy of the cpu_groups. SLIT does not affect the topology created in any way, but in theory it might cause additional splits in the hierarchy. Also looking at ACPI 6.2, there is the PPTT table which describes the same topology data in possibly easier to parse format.

Should main efforts go into integration with the ACPI-provided data, or do we want to consider CPUID/APIC topology to remain the main source of truth ?

In D11801#244745, @avg wrote:

Is there any practical benefit from adding this knowledge about the subdivision of the package into the nodes?

Yes — the scheduler can place threads on the same node (die) if a core on the same cache (ccx) is unavailable. This migration is still cheaper than placing the thread on a remote die.

Is there any planned follow-up work?

Yeah, assuming I can get my hands on some hardware.

I would prefer to not rush with this change especially given that the hardware is not quite wide-spread right now and there are some real problems with it.

There's no hurry.

Also, please note that there is already some form of the node detection that's used for [hopefully] correctly assigning L3 caches.

Yeah, although that is a different mechanism for an older model family. On 17h each CCX has its own L3 cache, but latencies are still lower within the same die than to remote dies.

In D11801#244793, @kib wrote:

As I understand the current situation, only the topology reported by the corresponding CPUID leaf forms the hierarchy of the cpu_groups. SLIT does not affect the topology created in any way, but in theory it might cause additional splits in the hierarchy. Also looking at ACPI 6.2, there is the PPTT table which describes the same topology data in possibly easier to parse format.

Should main efforts go into integration with the ACPI-provided data, or do we want to consider CPUID/APIC topology to remain the main source of truth ?

I don't really have any opinion on that. :-)

I don't know how reliable the ACPI-provided data is; it seems like we might have to keep this as a fallback either way?

Makes sense... I didn't realize that there is a two-level split: multiple CCX/L3-s within the same node plus multiple nodes/dies within the same package.
So, will need to add another entity to the scheduler topology elements as well.

In D11801#244813, @avg wrote:

Makes sense... I didn't realize that there is a two-level split: multiple CCX/L3-s within the same node plus multiple nodes/dies within the same package.
So, will need to add another entity to the scheduler topology elements as well.

Yep.

cem added a reviewer: truckman.Aug 2 2017, 2:55 AM

To create the correct topology on my 16 core 2 die system I had to add the following:

@ -720,7 +745,8 @@ x86topo_add_sched_group(struct topo_node *root, st
        int ncores;
        int i;
 
-       KASSERT(root->type == TOPO_TYPE_SYSTEM || root->type == TOPO_TYPE_CACHE,
+       KASSERT(root->type == TOPO_TYPE_SYSTEM || root->type == TOPO_TYPE_CACHE ||
+           root->type == TOPO_TYPE_GROUP,
            ("x86topo_add_sched_group: bad type: %u", root->type));
        CPU_COPY(&root->cpuset, &cg_root->cg_mask);
        cg_root->cg_count = root->cpu_count;
@@ -760,7 +786,8 @@ x86topo_add_sched_group(struct topo_node *root, st
        nchildren = 0;
        node = root;
        while (node != NULL) {
-               if (node->type != TOPO_TYPE_CACHE ||
+               if ((node->type != TOPO_TYPE_GROUP &&
+                   node->type != TOPO_TYPE_CACHE) ||
                    (root->type != TOPO_TYPE_SYSTEM &&
                    CPU_CMP(&node->cpuset, &root->cpuset) == 0)) {
                        node = topo_next_node(root, node);
@@ -780,7 +807,8 @@ x86topo_add_sched_group(struct topo_node *root, st
        node = root;
        i = 0;
        while (node != NULL) {
-               if (node->type != TOPO_TYPE_CACHE ||
+               if ((node->type != TOPO_TYPE_GROUP &&
+                   node->type != TOPO_TYPE_CACHE) ||
                    (root->type != TOPO_TYPE_SYSTEM &&
                    CPU_CMP(&node->cpuset, &root->cpuset) == 0)) {
                        node = topo_next_node(root, node);

Here's part of the resulting topo:

<group level="1" cache-level="0">
 <cpu count="32" mask="ffffffff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31</cpu>
 <children>
  <group level="2" cache-level="0">
   <cpu count="16" mask="ffff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15</cpu>
   <children>
    <group level="3" cache-level="3">
     <cpu count="8" mask="ff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7</cpu>
     <children>
      <group level="4" cache-level="2">
       <cpu count="2" mask="3,0,0,0">0, 1</cpu>
       <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>

It would be nice if the standard announcement string added 'die' as a parameter like so:
FreeBSD/SMP: 1 package(s) x 2 die x 8 core(s) x 2 hardware threads

It wouldn't make sense to print it on single die machines though. I guess if we went that far you might ask if we should do:
FreeBSD/SMP: 1 package(s) x 2 die x 2 complex x 4 core(s) x 2 hardware threads

as that would be more accurate for zen. I'm not sure if that much detail is necessary though. Also, the various latencies involved here are significant that an accurate topology will provide real benefit. I'm not sure what we gain by waiting to commit a finalized version of this patch.

Incorporate Jeff's fixes

cem edited the test plan for this revision. (Show Details)Aug 14 2017, 6:35 AM

Do not create sub-package group when node shift wasn't detected (non-Zen systems).

Does anyone object strongly to this going in? If so, why? Thanks.

kib accepted this revision.Aug 17 2017, 10:45 AM

This revision is now accepted and ready to land.Aug 17 2017, 10:45 AM

Closed by commit rS322621: Discover CPU topology on multi-die AMD Zen systems (authored by cem). · Explain WhyAug 17 2017, 4:55 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

Path

Size

head/

sys/

x86/

mp_x86.c

33 lines

Diff 32170

View Options

Discover CPU topology on multi-die AMD Zen systemsClosedPublicActions