On non-trivial SMP systems the contention on the pmc_owner mutex leads to a substantial number of samples captured being from the pmc process itself. This change a) makes buffers larger to avoid contention on the global list b) makes the working sample buffer per cpu.
will-it-scale/page_fault1_processes -t 96 -s 30 >& /dev/null &
pmcstat -S UNHALTED_CORE_CYCLES -n 21000000 -O pf1.pmcstat sleep 10
pmcstat -R pf1.pmcstat -z100 -G pf1.stacks
before:
https://github.com/mattmacy/profiling/blob/master/2018.04.22/pf1_orig.svg
after:
https://github.com/mattmacy/profiling/blob/master/2018.04.22/pf1.svg
without pmcstat running:
mmacy@anarchy [~/devel/freebsd|0:09|32] time make -j96 buildkernel -s >& /dev/null
make -j96 buildkernel -s >&/dev/null 2508.41s user 1047.14s system 6000% cpu 59.259 total
with pmcstat running:
pmcstat -S UNHALTED_CORE_CYCLES -n 21000000 -O /dev/null sleep 180 &
mmacy@anarchy [~/devel/freebsd|0:14|48] time make -j96 buildkernel -s >& /dev/null
make -j96 buildkernel -s >&/dev/null 2688.13s user 1547.66s system 6053% cpu 1:09.97 total
l