Page MenuHomeFreeBSD

Discover CPU topology on multi-die AMD Zen systems
ClosedPublic

Authored by cem on Jul 31 2017, 9:49 PM.
Tags
None
Referenced Files
F81545229: D11801.id32170.diff
Wed, Apr 17, 7:49 PM
F81545165: D11801.id31402.diff
Wed, Apr 17, 7:47 PM
F81544677: D11801.id32034.diff
Wed, Apr 17, 7:34 PM
F81543541: D11801.id.diff
Wed, Apr 17, 7:25 PM
F81543042: D11801.id32013.diff
Wed, Apr 17, 7:12 PM
Unknown Object (File)
Tue, Apr 16, 4:29 PM
Unknown Object (File)
Wed, Apr 10, 5:25 AM
Unknown Object (File)
Feb 14 2024, 4:10 AM
Subscribers

Details

Summary

The Nodes per Processor topology information determines how many bits of
the APIC ID represent the Node (Zeppelin die, on Zen systems) ID.
Documented in Ryzen and Epyc Processor Programming Reference (PPR).

Test Plan

I am currently unable to test this patch, don't have hardware in front of me.

See also: https://reviews.freebsd.org/D12019

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Is there any practical benefit from adding this knowledge about the subdivision of the package into the nodes?
Is there any planned follow-up work?
I would prefer to not rush with this change especially given that the hardware is not quite wide-spread right now and there are some real problems with it.

Also, please note that there is already some form of the node detection that's used for [hopefully] correctly assigning L3 caches.

Please do not consider my note below as an objection against the patch, rather it is a generic question. As a side effect, it might affect the patch.

As I understand the current situation, only the topology reported by the corresponding CPUID leaf forms the hierarchy of the cpu_groups. SLIT does not affect the topology created in any way, but in theory it might cause additional splits in the hierarchy. Also looking at ACPI 6.2, there is the PPTT table which describes the same topology data in possibly easier to parse format.

Should main efforts go into integration with the ACPI-provided data, or do we want to consider CPUID/APIC topology to remain the main source of truth ?

In D11801#244745, @avg wrote:

Is there any practical benefit from adding this knowledge about the subdivision of the package into the nodes?

Yes — the scheduler can place threads on the same node (die) if a core on the same cache (ccx) is unavailable. This migration is still cheaper than placing the thread on a remote die.

Is there any planned follow-up work?

Yeah, assuming I can get my hands on some hardware.

I would prefer to not rush with this change especially given that the hardware is not quite wide-spread right now and there are some real problems with it.

There's no hurry.

Also, please note that there is already some form of the node detection that's used for [hopefully] correctly assigning L3 caches.

Yeah, although that is a different mechanism for an older model family. On 17h each CCX has its own L3 cache, but latencies are still lower within the same die than to remote dies.

In D11801#244793, @kib wrote:

As I understand the current situation, only the topology reported by the corresponding CPUID leaf forms the hierarchy of the cpu_groups. SLIT does not affect the topology created in any way, but in theory it might cause additional splits in the hierarchy. Also looking at ACPI 6.2, there is the PPTT table which describes the same topology data in possibly easier to parse format.

Should main efforts go into integration with the ACPI-provided data, or do we want to consider CPUID/APIC topology to remain the main source of truth ?

I don't really have any opinion on that. :-)

I don't know how reliable the ACPI-provided data is; it seems like we might have to keep this as a fallback either way?

Makes sense... I didn't realize that there is a two-level split: multiple CCX/L3-s within the same node plus multiple nodes/dies within the same package.
So, will need to add another entity to the scheduler topology elements as well.

In D11801#244813, @avg wrote:

Makes sense... I didn't realize that there is a two-level split: multiple CCX/L3-s within the same node plus multiple nodes/dies within the same package.
So, will need to add another entity to the scheduler topology elements as well.

Yep.

jeff added a subscriber: jeff.

To create the correct topology on my 16 core 2 die system I had to add the following:

@ -720,7 +745,8 @@ x86topo_add_sched_group(struct topo_node *root, st
        int ncores;
        int i;
 
-       KASSERT(root->type == TOPO_TYPE_SYSTEM || root->type == TOPO_TYPE_CACHE,
+       KASSERT(root->type == TOPO_TYPE_SYSTEM || root->type == TOPO_TYPE_CACHE ||
+           root->type == TOPO_TYPE_GROUP,
            ("x86topo_add_sched_group: bad type: %u", root->type));
        CPU_COPY(&root->cpuset, &cg_root->cg_mask);
        cg_root->cg_count = root->cpu_count;
@@ -760,7 +786,8 @@ x86topo_add_sched_group(struct topo_node *root, st
        nchildren = 0;
        node = root;
        while (node != NULL) {
-               if (node->type != TOPO_TYPE_CACHE ||
+               if ((node->type != TOPO_TYPE_GROUP &&
+                   node->type != TOPO_TYPE_CACHE) ||
                    (root->type != TOPO_TYPE_SYSTEM &&
                    CPU_CMP(&node->cpuset, &root->cpuset) == 0)) {
                        node = topo_next_node(root, node);
@@ -780,7 +807,8 @@ x86topo_add_sched_group(struct topo_node *root, st
        node = root;
        i = 0;
        while (node != NULL) {
-               if (node->type != TOPO_TYPE_CACHE ||
+               if ((node->type != TOPO_TYPE_GROUP &&
+                   node->type != TOPO_TYPE_CACHE) ||
                    (root->type != TOPO_TYPE_SYSTEM &&
                    CPU_CMP(&node->cpuset, &root->cpuset) == 0)) {
                        node = topo_next_node(root, node);

Here's part of the resulting topo:

<group level="1" cache-level="0">
 <cpu count="32" mask="ffffffff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31</cpu>
 <children>
  <group level="2" cache-level="0">
   <cpu count="16" mask="ffff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15</cpu>
   <children>
    <group level="3" cache-level="3">
     <cpu count="8" mask="ff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7</cpu>
     <children>
      <group level="4" cache-level="2">
       <cpu count="2" mask="3,0,0,0">0, 1</cpu>
       <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>

It would be nice if the standard announcement string added 'die' as a parameter like so:
FreeBSD/SMP: 1 package(s) x 2 die x 8 core(s) x 2 hardware threads

It wouldn't make sense to print it on single die machines though. I guess if we went that far you might ask if we should do:
FreeBSD/SMP: 1 package(s) x 2 die x 2 complex x 4 core(s) x 2 hardware threads

as that would be more accurate for zen. I'm not sure if that much detail is necessary though. Also, the various latencies involved here are significant that an accurate topology will provide real benefit. I'm not sure what we gain by waiting to commit a finalized version of this patch.

Incorporate Jeff's fixes

Do not create sub-package group when node shift wasn't detected (non-Zen systems).

Does anyone object strongly to this going in? If so, why? Thanks.

This revision is now accepted and ready to land.Aug 17 2017, 10:45 AM
This revision was automatically updated to reflect the committed changes.