The Nodes per Processor topology information determines how many bits of
the APIC ID represent the Node (Zeppelin die, on Zen systems) ID.
Documented in Ryzen and Epyc Processor Programming Reference (PPR).
Details
I am currently unable to test this patch, don't have hardware in front of me.
See also: https://reviews.freebsd.org/D12019
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
Is there any practical benefit from adding this knowledge about the subdivision of the package into the nodes?
Is there any planned follow-up work?
I would prefer to not rush with this change especially given that the hardware is not quite wide-spread right now and there are some real problems with it.
Also, please note that there is already some form of the node detection that's used for [hopefully] correctly assigning L3 caches.
Please do not consider my note below as an objection against the patch, rather it is a generic question. As a side effect, it might affect the patch.
As I understand the current situation, only the topology reported by the corresponding CPUID leaf forms the hierarchy of the cpu_groups. SLIT does not affect the topology created in any way, but in theory it might cause additional splits in the hierarchy. Also looking at ACPI 6.2, there is the PPTT table which describes the same topology data in possibly easier to parse format.
Should main efforts go into integration with the ACPI-provided data, or do we want to consider CPUID/APIC topology to remain the main source of truth ?
Yes — the scheduler can place threads on the same node (die) if a core on the same cache (ccx) is unavailable. This migration is still cheaper than placing the thread on a remote die.
Is there any planned follow-up work?
Yeah, assuming I can get my hands on some hardware.
I would prefer to not rush with this change especially given that the hardware is not quite wide-spread right now and there are some real problems with it.
There's no hurry.
Also, please note that there is already some form of the node detection that's used for [hopefully] correctly assigning L3 caches.
Yeah, although that is a different mechanism for an older model family. On 17h each CCX has its own L3 cache, but latencies are still lower within the same die than to remote dies.
I don't really have any opinion on that. :-)
I don't know how reliable the ACPI-provided data is; it seems like we might have to keep this as a fallback either way?
Makes sense... I didn't realize that there is a two-level split: multiple CCX/L3-s within the same node plus multiple nodes/dies within the same package.
So, will need to add another entity to the scheduler topology elements as well.
To create the correct topology on my 16 core 2 die system I had to add the following:
@ -720,7 +745,8 @@ x86topo_add_sched_group(struct topo_node *root, st int ncores; int i; - KASSERT(root->type == TOPO_TYPE_SYSTEM || root->type == TOPO_TYPE_CACHE, + KASSERT(root->type == TOPO_TYPE_SYSTEM || root->type == TOPO_TYPE_CACHE || + root->type == TOPO_TYPE_GROUP, ("x86topo_add_sched_group: bad type: %u", root->type)); CPU_COPY(&root->cpuset, &cg_root->cg_mask); cg_root->cg_count = root->cpu_count; @@ -760,7 +786,8 @@ x86topo_add_sched_group(struct topo_node *root, st nchildren = 0; node = root; while (node != NULL) { - if (node->type != TOPO_TYPE_CACHE || + if ((node->type != TOPO_TYPE_GROUP && + node->type != TOPO_TYPE_CACHE) || (root->type != TOPO_TYPE_SYSTEM && CPU_CMP(&node->cpuset, &root->cpuset) == 0)) { node = topo_next_node(root, node); @@ -780,7 +807,8 @@ x86topo_add_sched_group(struct topo_node *root, st node = root; i = 0; while (node != NULL) { - if (node->type != TOPO_TYPE_CACHE || + if ((node->type != TOPO_TYPE_GROUP && + node->type != TOPO_TYPE_CACHE) || (root->type != TOPO_TYPE_SYSTEM && CPU_CMP(&node->cpuset, &root->cpuset) == 0)) { node = topo_next_node(root, node);
Here's part of the resulting topo:
<group level="1" cache-level="0"> <cpu count="32" mask="ffffffff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31</cpu> <children> <group level="2" cache-level="0"> <cpu count="16" mask="ffff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15</cpu> <children> <group level="3" cache-level="3"> <cpu count="8" mask="ff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7</cpu> <children> <group level="4" cache-level="2"> <cpu count="2" mask="3,0,0,0">0, 1</cpu> <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
It would be nice if the standard announcement string added 'die' as a parameter like so:
FreeBSD/SMP: 1 package(s) x 2 die x 8 core(s) x 2 hardware threads
It wouldn't make sense to print it on single die machines though. I guess if we went that far you might ask if we should do:
FreeBSD/SMP: 1 package(s) x 2 die x 2 complex x 4 core(s) x 2 hardware threads
as that would be more accurate for zen. I'm not sure if that much detail is necessary though. Also, the various latencies involved here are significant that an accurate topology will provide real benefit. I'm not sure what we gain by waiting to commit a finalized version of this patch.