quiesce_cpus schedules the current thread on all available CPUs and is basically broken on contemporary machines with no hopes of getting the job done in a reasonable time. In my particular case on a 104-way box it failed to complete after 4 minutes. It should be eliminated but it has yet another consumer.
lockprof can get away with a much cheaper requirement: we need to know everyone left the code section. Stolen from rmlocks this patch adds an IPI which posts a cst fence. This paired with rel fence before critical section is exited in code updating stats provides necessary synchronization to reliably wait for everyone to leave.
With the patch all ops work with almost no delay.
Also note the approach can be used with vfs_op_thread_enter/exit, but I find it a little iffy without a bitmap of CPUs which ever used it for reasons which I'll explain in a different review.