Before the actual patch: all struct mount counters can and should be split per-cpu, I have a complete patch which does that (i.e. ref, lockref, witeopcount) and passes my runs of stress2. The patch below is a logical step towards that goal.
Contention on struct mount mtx is one of two top bottlenecks during buildkernel on tmpfs.
Basic idea is to provide a "fast path" mode where code just manipulating counters can get away without the mnt lock. Fast path is protected with an rmlock. Any time something altering the mount point is to be done (e.g. unmounting, suspending writes) fast path can be disabled and everyone is back to using the current code.
Disabling the fast path takes the rmlock for writing and bump the disabling counter. Doing this provides an invariant that anyone touching the struct from this point on will see the fast path disabled and will fall back to the slow path (taking the mutex).
For safety and simplicity mount points are allocated with fast path disabled which has to be enabled later, after the code is done setting everything up.
This patch contains both introduction of the rmlock and switch of mnt_ref to atomics (and a further patch will switch them to per-cpu). It can be spilt into two.
I'm not entirely fond of rmlocks here and the entire thing can be made faster single-threaded at cost of increased complexity. I think this provides a good enough solution for the time being with all the smp win for the common case. I have an experimental patch which uses a hand-rolled barrier instead of an rmlock which I can post later.
rmlock can be used for other things as well, e.g. opt-in root vnode caching in similar manner as implemented for zfs in https://reviews.freebsd.org/D17233
Note the patch is generated on top of https://reviews.freebsd.org/D21411
With these patches struct mount mtx almost disappears from profiles.