Page MenuHomeFreeBSD

Socket level scheduling support for Multi Socket system
Needs ReviewPublic

Authored by selwinsebastian_gmail.com on Oct 8 2018, 9:21 AM.

Details

Reviewers
mmacy
kib
avg
Summary

Adding kernel support for socket level scheduling for multi socket systems.

Attached patch is tested on an AMD EPYC 2P system.

Default :
sysctl kern.sched.topology_spec command on a 2P EPYC system with each socket having 64 threads gives the following topology

kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">

<cpu count="128" mask="ffffffffffffffff,ffffffffffffffff,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127</cpu>
<children>
 <group level="2" cache-level="0">
  <cpu count="16" mask="ffff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15</cpu>
  <children>
   <group level="3" cache-level="3">
    <cpu count="8" mask="ff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7</cpu>
    <children>
     <group level="4" cache-level="2">
      <cpu count="2" mask="3,0,0,0">0, 1</cpu>
      <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>
  ....

Modified:
With the new patch the topology is as below

kern.sched.topology_spec: <groups>
<group level="1" cache-level="0">

<cpu count="128" mask="ffffffffffffffff,ffffffffffffffff,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127</cpu>
<children>
 <group level="2" cache-level="0">
  <cpu count="64" mask="ffffffffffffffff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63</cpu>
  <children>
   <group level="3" cache-level="0">
    <cpu count="16" mask="ffff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15</cpu>
    <children>
     <group level="4" cache-level="3">
      <cpu count="8" mask="ff,0,0,0">0, 1, 2, 3, 4, 5, 6, 7</cpu>
      <children>
       <group level="5" cache-level="2">
        <cpu count="2" mask="3,0,0,0">0, 1</cpu>
        <flags><flag name="THREAD">THREAD group</flag><flag name="SMT">SMT group</flag></flags>

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 20063
Build 19562: arc lint + arc unit

Event Timeline

selwinsebastian_gmail.com changed the visibility from "Public (No Login Required)" to "No One".Oct 8 2018, 9:21 AM
selwinsebastian_gmail.com changed the edit policy from "All Users" to "No One".
selwinsebastian_gmail.com changed the visibility from "No One" to "Custom Policy".Oct 8 2018, 4:24 PM
selwinsebastian_gmail.com changed the visibility from "Custom Policy" to "All Users".

@mmacy Any review comments on this patch ?

selwinsebastian_gmail.com changed the visibility from "All Users" to "Public (No Login Required)".Nov 19 2018, 11:57 AM
kib added a comment.Nov 19 2018, 12:44 PM

Add avg to the reviewers list. I have no idea why did you forbidden the editing of the revision for devs.

selwinsebastian_gmail.com changed the edit policy from "No One" to "All Users".
In D17465#385774, @kib wrote:

Add avg to the reviewers list. I have no idea why did you forbidden the editing of the revision for devs.

Corrected ... Thank you .

avg added a comment.Nov 20 2018, 6:58 AM

I see what the change does but not why. Could you please provide some rationale?

AMD EPYC (family 17h) can have upto 4 dies per socket.

Die (Dx) View :


C0  | T0 T1 |    ||    | T0 T1 | C4
    --------|    ||    |--------
C1  | T0 T1 | L3 || L3 | T0 T1 | C5
    --------|    ||    |--------
C2  | T0 T1 | #0 || #1 | T0 T1 | C6
    --------|    ||    |--------
C3  | T0 T1 |    ||    | T0 T1 | C7
    ----------------------------

2P machine System View (with 2 socket) :


     |     -------------|------
     |     |            |     |
   ------------       ------------
   | D1 -- D0 |       | D7 -- D6 |
   | |  \/ |  |       | |  \/ |  |
   | |  /\ |  |       | |  /\ |  | 
   | D2 -- D3 |       | D4 -- D5 |
   ------------       ------------
     |     |            |     |
     ------|------------|     |
           --------------------
SOCKET0			SOCKET1

This results in incomplete scheduler topology as below with the current code

group level 1 cache-level 0 : TOPO_TYPE_PKG /*platform*/
---------Missing SOCKET level group -------- /*socket /NUMA */
group level 2 cache-level 0 : TOPO_TYPE_GROUP /*die /shared memory controller*/
group level 3 cache-level 3 : TOPO_TYPE_CACHE /*ccx/ shared L3*/
group level 4 cache-level 2 : TOPO_TYPE_CORE /*cpu core*/
group level 5 cache-level 1 : TOPO_TYPE_PU /*smt thread*/

with the patch the topology is as below

group level 1 cache-level 0 : TOPO_TYPE_PKG /*platform*/
group level 2 cache-level 0 : TOPO_TYPE_NODE /*socket / NUMA */
group level 3 cache-level 0 : TOPO_TYPE_GROUP /*die /shared memory controller*/
group level 4 cache-level 3 : TOPO_TYPE_CACHE /*ccx/ shared L3*/
group level 5 cache-level 2 : TOPO_TYPE_CORE /*cpu core*/
group level 6 cache-level 1 : TOPO_TYPE_PU /*smt thread*/

Scheduler should be aware of the topology of socket also as to keep threads in same socket if possible while migration as
a thread migrated to the next socket can result in higher latency and hence bad performance.

avg added subscribers: jeff, cem, mav.Nov 21 2018, 6:18 AM

+ @mav + @jeff + @cem

Scheduler should be aware of the topology of socket also as to keep threads in same socket if possible while migration as

I think that there is no inherent reason why the scheduler should care about sockets.
ULE scheduler cares only about data affinity and not physical topology.

a thread migrated to the next socket can result in higher latency and hence bad performance.

So, this is about NUMA, it seems.
Then, in my opinion, it would be best to inject the real NUMA information into the topology (and that should be conditional on whether a kernel is NUMA aware, whether memory allocations were done in NUMA aware fashion, etc).
If this patch is a "first approximation" towards the proper NUMA support, then I do not have any strong objections, but I am not convinced either.

jeff added a comment.Nov 21 2018, 6:42 AM

I have a more complete version of a similar concept sitting in a local repo. Give me a few weeks to get it together and we can discuss.