Currently stats are collected in a MAXCPU-sized array which is not aligned and suffers enormous false-sharing. Fix the problem by utilizing per-cpu allocation.
The counter(9) API is not used here as it is too incomplete and does not provide a win over per-cpu zone sized for malloc stats struct. In particular stats are being reported for each cpu separately by just copying what is supposed to be an array element for given cpu.
malloc_type_stats has 3 uint64_t-sized fields in there for padding (against other cpus - the struct is 64 bytes of size, but the array consisting of these structs is not aligned) which I left in place to simplify stat reporting. I can declare malloc_type_stats_export or similar + populate it appropriately + export that. Then the waste can be removed. Note the patch already provides savings: there are mp_maxid + 1 elements, not MAXCPU. Stat collection for uma zones still suffers the problem and will require a different fix.
The sharing is very visible on Skylake.
lock1 benchmark (48-way) of the will-it-scale suite:
min:164096 max:1070674 total:30436206 min:180144 max:1081428 total:30055726 min:156210 max:1043012 total:29958074 min:126894 max:1080280 total:30590810 min:121228 max:1089982 total:30995994 min:130596 max:1125478 total:31431682 %SAMP IMAGE FUNCTION CALLERS 21.6 kernel lf_advlockasync lf_advlock 21.2 kernel malloc lf_advlockasync 9.5 kernel lf_free_lock lf_advlockasync:5.5 lf_activate_lock:4.0 8.1 kernel lock_delay _sx_xlock_hard 7.2 kernel uma_zalloc_arg malloc 6.1 kernel uma_zfree_arg free 5.7 kernel _sx_xlock_hard lf_advlockasync:2.9 lf_free_lock:2.8 5.0 kernel free lf_free_lock 1.9 kernel kern_fcntl kern_fcntl_freebsd 1.8 kernel Xfast_syscall 1.7 kernel fget_unlocked kern_fcntl 1.4 kernel copyin_smap_erms kern_fcntl_freebsd 1.4 kernel amd64_syscall 1.0 kernel kern_fcntl_freebsd amd64_syscall 1.0 kernel cpu_set_syscall_retv amd64_syscall 0.8 libc.so.7 0x12fb4a testcase 0.5 kernel sleepq_lock wakeup 0.5 kernel VOP_ADVLOCK_APV kern_fcntl
patched:
min:189106 max:1738522 total:38471938 min:214540 max:1766486 total:38174320 min:198314 max:1737370 total:38208790 min:187510 max:1721256 total:37609386 min:206284 max:1719908 total:37766580 %SAMP IMAGE FUNCTION CALLERS 25.9 kernel lf_advlockasync lf_advlock 11.1 kernel lock_delay _sx_xlock_hard 10.1 kernel uma_zalloc_arg malloc 9.5 kernel lf_free_lock lf_advlockasync:5.0 lf_activate_lock:4.4 9.1 kernel uma_zfree_arg free 4.5 kernel _sx_xlock_hard lf_advlockasync:2.5 lf_free_lock:2.0 3.3 kernel Xfast_syscall 3.2 kernel kern_fcntl kern_fcntl_freebsd 2.9 kernel fget_unlocked kern_fcntl 2.6 kernel copyin_smap_erms kern_fcntl_freebsd 2.3 kernel amd64_syscall 2.1 kernel free lf_free_lock 1.9 kernel malloc lf_advlockasync 1.4 libc.so.7 getdiskbyname _init 1.2 kernel kern_fcntl_freebsd amd64_syscall 0.9 kernel VOP_ADVLOCK_APV kern_fcntl 0.8 kernel cpu_fetch_syscall_ar amd64_syscall 0.7 kernel sleepq_lock wakeup 0.6 kernel cpu_set_syscall_retv amd64_syscall 0.6 liblzma.so lzma_filter_flags_en _init 0.6 kernel lf_activate_lock lf_advlockasync