I do not have any machine where this instruction is implemented. Also, the AMD APM page for CLZERO mentions streaming/non-temporal stores used to implement this instruction, so it might actually cause the same slowing effect as non-termporal SSE stores. Still, I think it is worth a try.
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 26521
Event Timeline
Most of the slowness was stemming from lack of support for NUMA meaning for the "wrong" pages the stores were sent across the interconnect. I redid the tests some time ago on the same hardware and there is next to no difference real time (but just more cache misses).
mmacy@ has a machine which should have this instruction, but it is very heavily bottlenecked with contention on the vm object lock so it's hard to do a real test (128-way). Sentex folks have smaller EPYC boxes (32-way if memory serves) and perhaps would be willing to lend them for few tests. However, I think this will have to wait until known significant bottlenecks get taken care of (most notably for the kernel build it is struct mount mtx solved with my per-cpu patches and vm object contention which jeff@ promised to take care of in near future).
I just checked and Linux still does not use the instruction if present which makes me skeptical here. I did not find any lkml discussion with benchmark results either, so it may be they did not even try.
mmacy@ has a machine which should have this instruction, but it is very heavily bottlenecked with contention on the vm object lock so it's hard to do a real test (128-way). Sentex folks have smaller EPYC boxes (32-way if memory serves) and perhaps would be willing to lend them for few tests. However, I think this will have to wait until known significant bottlenecks get taken care of (most notably for the kernel build it is struct mount mtx solved with my per-cpu patches and vm object contention which jeff@ promised to take care of in near future).
It is not necessary to do the full-scale test. For instance, -j 16 build might show some change, which would be already useful. Use of CLZERO might be beneficial because it puts less load on execution part of the CPU, I do not believe that FreeBSD only interest is extreme scale-out.
Lets put it in other way. I am interested in correctness review and test first, then I will look for some demonstration of benefits in specific scenarios.
I just checked and Linux still does not use the instruction if present which makes me skeptical here. I did not find any lkml discussion with benchmark results either, so it may be they did not even try.
For instance, Linux did not used PCID on Intels for long time, started utilizing them only with KPTI.
So I just checked and packet.net has 24-way EPYC boxes. I can get one no problem (paid by FF). Then I can get some data points - at least buildkernel on tmpfs and will-it-scale.
I did not want a full scale test (in fact I wanted something smaller like 32), just mentioned the one box which was immediately available was rather big and suffering from it even if you try lower scale. But even at 24 we may find the current bottlenecks to enough of a problem to distort the result.
Booting on packet.net has failed, I did not investigate. cperciva gave me a box to play with in ec2 instead.
The box is:
CPU: AMD AMD64 Processor (2200.02-MHz K8-class CPU)
Origin="AuthenticAMD" Id=0x800f12 Family=0x17 Model=0x1 Stepping=2 Features=0x1783fbff<FPU,VME,DE,PSE,TSC,MSR,PAE,MCE,CX8,APIC,SEP,MTRR,PGE,MCA,CMOV,PAT,PSE36,MMX,FXSR,SSE,SSE2,HTT> Features2=0xfed83203<SSE3,PCLMULQDQ,SSSE3,FMA,CX16,SSE4.1,SSE4.2,MOVBE,POPCNT,AESNI,XSAVE,OSXSAVE,AVX,F16C,RDRAND,HV> AMD Features=0x2e500800<SYSCALL,NX,MMX+,FFXSR,Page1GB,RDTSCP,LM> AMD Features2=0xc001f3<LAHF,CMP,CR8,ABM,SSE4A,MAS,Prefetch,Topology,PCXC> Structured Extended Features=0x209c01a9<FSGSBASE,BMI1,AVX2,SMEP,BMI2,RDSEED,ADX,SMAP,CLFLUSHOPT,SHA> XSAVE Features=0x7<XSAVEOPT,XSAVEC,XINUSE> AMD Extended Feature Extensions ID EBX=0x5<CLZERO,XSaveErPtr> TSC: P-state invariant, performance statisticsHypervisor: Origin = "KVMKVMKVM"
real memory = 274743689216 (262016 MB)
avail memory = 262917218304 (250737 MB)
Event timer "LAPIC" quality 600
ACPI APIC Table: <AMAZON AMZNAPIC>
FreeBSD/SMP: Multiprocessor System Detected: 64 CPUs
FreeBSD/SMP: 1 package(s) x 4 groups x 2 cache groups x 4 core(s) x 2 hardware threads
The kernel is r352430 (the test was performed over 2 weeks ago, did no have the time to write up results).
I modified the kernel to have a runtime switch for the routine to use and ran buildkernel with both 16 and 64 threads.
In short, system time got reduced a little bit and but the difference was added to user time (presumably cost of handling incurred cache misses). Unfortunately I don't see a good way to get decent idea what perf events got triggered. As noted earlier this probably will have to be revisited after vm object problem is sorted out. As it is, I think the change is pessimal in that it increases traffic to RAM.
16-way:
x stosq-user + clzero-user +----------------------------------------------------------------------+ | x + + | |xx x x x x x x x x x x + ++ + ++ +++ + + +| | |________A_M_____| |_________AM_______| | +----------------------------------------------------------------------+ N Min Max Median Avg Stddev x 14 2854.52 2860.53 2857.925 2857.4679 1.7958207 + 14 2861.99 2869.36 2866.11 2865.815 2.0445227 Difference at 95.0% confidence 8.34714 +/- 1.49528 0.292117% +/- 0.0523955% (Student's t, pooled s = 1.92419) x stosq-sys + clzero-sys +----------------------------------------------------------------------+ | + + x x x | |+ + + +++ + ++++ + x x x xxxx x x x| | |_________AM________| |______MA________| | +----------------------------------------------------------------------+ N Min Max Median Avg Stddev x 14 260.34 268.99 263.46 263.86929 2.4009917 + 14 249.99 259.4 254.72 254.41786 2.7984479 Difference at 95.0% confidence -9.45143 +/- 2.02612 -3.58186% +/- 0.756312% (Student's t, pooled s = 2.6073) x stosq-user-sys + clzero-user-sys +----------------------------------------------------------------------+ | x + | |+ + +x +x* x*x + x ++ xx * + x+ + x x| | |____|_________MA____A__________|_____| | +----------------------------------------------------------------------+ N Min Max Median Avg Stddev x 14 3117.93 3128.64 3120.17 3121.3371 3.282254 + 14 3114.85 3125.43 3120.02 3120.2329 3.1351811 No difference proven at 95.0% confidence
64-way:
x stosq-user + clzero-user +----------------------------------------------------------------------+ | + + +| |x xx xx + xx +xx x+x+ +x x+ x+ + + + +| | |_______________A_M___|________|___AM___________| | +----------------------------------------------------------------------+ N Min Max Median Avg Stddev x 14 2905.51 2937.02 2925.685 2923.9879 9.6321978 + 14 2919.33 2948.4 2936.755 2936.0993 7.9271262 Difference at 95.0% confidence 12.1114 +/- 6.85472 0.414209% +/- 0.23501% (Student's t, pooled s = 8.82096) x stosq-sys + clzero-sys +----------------------------------------------------------------------+ | +xx+ x | |+ ++ * + + x+ *+ +xx+ xx + x xx x| | |___________A__________|__MA______________| | +----------------------------------------------------------------------+ N Min Max Median Avg Stddev x 14 453.52 521.35 479.165 480.85357 17.333084 + 14 441.47 488.53 462.985 463.02214 13.330426 Difference at 95.0% confidence -17.8314 +/- 12.0153 -3.70829% +/- 2.44093% (Student's t, pooled s = 15.4618) x stosq-user-sys + clzero-user-sys +----------------------------------------------------------------------+ | x + | |++ + + x x * ++* xx x +x * + * x + x| | |__________|______MA___M_A___________|_| | +----------------------------------------------------------------------+ N Min Max Median Avg Stddev x 14 3387.84 3434.31 3402.385 3404.8414 12.235422 + 14 3373.85 3427.61 3398.51 3399.1214 16.203147 No difference proven at 95.0% confidence
Am I right that the total wall clock time for buildworld is same, while the system time slighly reduced ?
system time is reduced, but user time is increased. In the current state there is no win and it's probably a pessimization due to increased memory traffic (as in, it's going to probably hurt more involved workloads). However, the result may be distorted by current bottlenecks (mostly vm object handling), which is why I think this needs to be reevaluated after this stuff gets fixed.
There is some recommendations from AMD about CLZERO, I think it is as official as we can get to it. From the presentation 'AMD RYZEN™ PROCESSOR SOFTWARE OPTIMIZATION PRESENTED BY KEN MITCHELL' at GDC19:
CLZERO is
- Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memory. Example: kill a corrupt user process but keep the system running.
- Use memset rather than the CLZERO intrinsic to quickly zero memory.