Use AMD CLZERO instruction for pagezero.
AbandonedPublic
Actions

Authored by kib on Sep 16 2019, 6:14 PM.

Details

Reviewers

emaste
mjg

Summary

I do not have any machine where this instruction is implemented. Also, the AMD APM page for CLZERO mentions streaming/non-temporal stores used to implement this instruction, so it might actually cause the same slowing effect as non-termporal SSE stores. Still, I think it is worth a try.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 26521

Event Timeline

kib created this revision.Sep 16 2019, 6:14 PM

Herald added a subscriber: imp. · View Herald TranscriptSep 16 2019, 6:14 PM

Harbormaster completed remote builds in B26521: Diff 62171.Sep 16 2019, 6:14 PM

I'll let @mjg weigh in on the asm

sys/amd64/amd64/machdep.c
2755–2756	With three cases it would be clearer to me to write if (...) return (...); if (...) return (...); return (pagezero_std);

Most of the slowness was stemming from lack of support for NUMA meaning for the "wrong" pages the stores were sent across the interconnect. I redid the tests some time ago on the same hardware and there is next to no difference real time (but just more cache misses).

mmacy@ has a machine which should have this instruction, but it is very heavily bottlenecked with contention on the vm object lock so it's hard to do a real test (128-way). Sentex folks have smaller EPYC boxes (32-way if memory serves) and perhaps would be willing to lend them for few tests. However, I think this will have to wait until known significant bottlenecks get taken care of (most notably for the kernel build it is struct mount mtx solved with my per-cpu patches and vm object contention which jeff@ promised to take care of in near future).

I just checked and Linux still does not use the instruction if present which makes me skeptical here. I did not find any lkml discussion with benchmark results either, so it may be they did not even try.

In D21678#472744, @mjg wrote:

mmacy@ has a machine which should have this instruction, but it is very heavily bottlenecked with contention on the vm object lock so it's hard to do a real test (128-way). Sentex folks have smaller EPYC boxes (32-way if memory serves) and perhaps would be willing to lend them for few tests. However, I think this will have to wait until known significant bottlenecks get taken care of (most notably for the kernel build it is struct mount mtx solved with my per-cpu patches and vm object contention which jeff@ promised to take care of in near future).

It is not necessary to do the full-scale test. For instance, -j 16 build might show some change, which would be already useful. Use of CLZERO might be beneficial because it puts less load on execution part of the CPU, I do not believe that FreeBSD only interest is extreme scale-out.

Lets put it in other way. I am interested in correctness review and test first, then I will look for some demonstration of benefits in specific scenarios.

I just checked and Linux still does not use the instruction if present which makes me skeptical here. I did not find any lkml discussion with benchmark results either, so it may be they did not even try.

For instance, Linux did not used PCID on Intels for long time, started utilizing them only with KPTI.

So I just checked and packet.net has 24-way EPYC boxes. I can get one no problem (paid by FF). Then I can get some data points - at least buildkernel on tmpfs and will-it-scale.

I did not want a full scale test (in fact I wanted something smaller like 32), just mentioned the one box which was immediately available was rather big and suffering from it even if you try lower scale. But even at 24 we may find the current bottlenecks to enough of a problem to distort the result.

Booting on packet.net has failed, I did not investigate. cperciva gave me a box to play with in ec2 instead.

The box is:

CPU: AMD AMD64 Processor (2200.02-MHz K8-class CPU)
Origin="AuthenticAMD"  Id=0x800f12  Family=0x17  Model=0x1  Stepping=2
Features=0x1783fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,HTT>
Features2=0xfed83203<SSE3,PCLMULQDQ,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND,HV>
AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM>
AMD Features2=0xc001f3<LAHF,CMP,CR8,ABM,SSE4A,MAS,Prefetch,Topology,PCXC>
Structured Extended Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA>
XSAVE Features=0x7<XSAVEOPT,XSAVEC,XINUSE>
AMD Extended Feature Extensions ID EBX=0x5<CLZERO,XSaveErPtr>
TSC: P-state invariant, performance statistics
Hypervisor: Origin = "KVMKVMKVM"
real memory = 274743689216 (262016 MB)
avail memory = 262917218304 (250737 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <AMAZON AMZNAPIC>
FreeBSD/SMP: Multiprocessor System Detected: 64 CPUs
FreeBSD/SMP: 1 package(s) x 4 groups x 2 cache groups x 4 core(s) x 2 hardware threads

The kernel is r352430 (the test was performed over 2 weeks ago, did no have the time to write up results).

I modified the kernel to have a runtime switch for the routine to use and ran buildkernel with both 16 and 64 threads.

In short, system time got reduced a little bit and but the difference was added to user time (presumably cost of handling incurred cache misses). Unfortunately I don't see a good way to get decent idea what perf events got triggered. As noted earlier this probably will have to be revisited after vm object problem is sorted out. As it is, I think the change is pessimal in that it increases traffic to RAM.

16-way:

x stosq-user
+ clzero-user
+----------------------------------------------------------------------+
|                 x                 +                  +               |
|xx  x x    x x x x x x x    x      +         ++    + ++ +++ + +      +|
|     |________A_M_____|                    |_________AM_______|       |
+----------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  14       2854.52       2860.53      2857.925     2857.4679     1.7958207
+  14       2861.99       2869.36       2866.11      2865.815     2.0445227
Difference at 95.0% confidence
        8.34714 +/- 1.49528
        0.292117% +/- 0.0523955%
        (Student's t, pooled s = 1.92419)

x stosq-sys
+ clzero-sys
+----------------------------------------------------------------------+
|  +                     +                    x  x x                   |
|+ +      +  +++     + ++++        +   x   x  x  xxxx       x      x  x|
|      |_________AM________|               |______MA________|          |
+----------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  14        260.34        268.99        263.46     263.86929     2.4009917
+  14        249.99         259.4        254.72     254.41786     2.7984479
Difference at 95.0% confidence
        -9.45143 +/- 2.02612
        -3.58186% +/- 0.756312%
        (Student's t, pooled s = 2.6073)

x stosq-user-sys
+ clzero-user-sys
+----------------------------------------------------------------------+
|                   x      +                                           |
|+    +        +x +x*  x*x +  x ++  xx  *     + x+    +      x        x|
|           |____|_________MA____A__________|_____|                    |
+----------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  14       3117.93       3128.64       3120.17     3121.3371      3.282254
+  14       3114.85       3125.43       3120.02     3120.2329     3.1351811
No difference proven at 95.0% confidence

64-way:

x stosq-user
+ clzero-user
+----------------------------------------------------------------------+
|                                           +          +              +|
|x           xx   xx   +       xx +xx    x+x+ +x x+ x+ + +  +         +|
|              |_______________A_M___|________|___AM___________|       |
+----------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  14       2905.51       2937.02      2925.685     2923.9879     9.6321978
+  14       2919.33        2948.4      2936.755     2936.0993     7.9271262
Difference at 95.0% confidence
        12.1114 +/- 6.85472
        0.414209% +/- 0.23501%
        (Student's t, pooled s = 8.82096)

x stosq-sys
+ clzero-sys
+----------------------------------------------------------------------+
|                           +xx+      x                                |
|+    ++   * + + x+  *+     +xx+     xx   + x   xx                    x|
|       |___________A__________|__MA______________|                    |
+----------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  14        453.52        521.35       479.165     480.85357     17.333084
+  14        441.47        488.53       462.985     463.02214     13.330426
Difference at 95.0% confidence
        -17.8314 +/- 12.0153
        -3.70829% +/- 2.44093%
        (Student's t, pooled s = 15.4618)

x stosq-user-sys
+ clzero-user-sys
+----------------------------------------------------------------------+
|                             x                +                       |
|++      +   +   x  x    *  ++*  xx x +x  *  + *       x      +       x|
|          |__________|______MA___M_A___________|_|                    |
+----------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  14       3387.84       3434.31      3402.385     3404.8414     12.235422
+  14       3373.85       3427.61       3398.51     3399.1214     16.203147
No difference proven at 95.0% confidence

Am I right that the total wall clock time for buildworld is same, while the system time slighly reduced ?

system time is reduced, but user time is increased. In the current state there is no win and it's probably a pessimization due to increased memory traffic (as in, it's going to probably hurt more involved workloads). However, the result may be distorted by current bottlenecks (mostly vm object handling), which is why I think this needs to be reevaluated after this stuff gets fixed.

There is some recommendations from AMD about CLZERO, I think it is as official as we can get to it. From the presentation 'AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION PRESENTED BY KEN MITCHELL' at GDC19:
CLZERO is

Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memory. Example: kill a corrupt user process but keep the system running.
Use memset rather than the CLZERO intrinsic to quickly zero memory.