The previous posting included an rmlock which I considered only a temporary measure from the get go, included specifically to avoid discussions on how to specifically handle per-cpu counting. It turned out to work very differently, thus after a discussion with Jeff (who had reservations about even temporary inclusion of an rmlock) I'm posting an update review with more rationale.
The main problem to be solved is handling of struct mount counters. They are very frequently modified and almost never need to be checked, which makes them a natural fit for a per-cpu-based scheme. The approach implemented here effectively provides support for code section during which we are guaranteed the modifying party will wait for us to finish (and the other way around, we guarantee to not modify anything past a certain point). This automatically lends itself to solve other problems (most notably root vnode caching) and is therefore my first pick.
There are many different ways to do it of course. One proposed by jeff@ is to cmpset into the counter. I disagree with this method for the following reasons:
- I find it a little cumbersome to use - it has to account for a case where the count would overflow/underflow which avoidably extends the common case
- busy/unbusy and write suspension always modify two counters at the same time. with this approach all the provisions have to be done twice. with my approach we have to provide safety only once. iow this approach is slower for common consumers due to 2 atomic ops and more branches
- but most importantly this does not provide the idea of code section which can be safely executed in face of changes to the struct, which I think is an unnecessary loss
The patch below was tested by me with stress2 and poudriere on amd64. On top of that I ran crossmp8 and suj11 from said suite on a powerpc64 box (talos), no problems seen.
Currently the code is rather coarse grained: either all per-cpu oepration is allowed or none is. This can be easily modified later with flags passed by the blocking thread and flags passed by threads entering.
root vnode caching would look like this:
if (!vfs_op_thread_enter(mp)) return (vfs_cache_root_fallback(mp, flags, vpp)); vp = (struct vnode *)atomic_load_ptr(&mp->mnt_rootvnode); if (vp == NULL || (vp->v_iflag & VI_DOOMED)) { vfs_op_thread_exit(mp); return (vfs_cache_root_fallback(mp, flags, vpp)); } vrefact(vp); vfs_op_thread_exit(mp); error = vn_lock(vp, flags); .....
Should the vnode need to be cleared it can look like this (assuming only one thread can do it):
vp = mp->mnt_rootvnode; if (vp == NULL) return (NULL); mp->mnt_rootvnode = NULL; atomic_thread_fence_seq_cst(); vfs_op_barrier_wait(mp); vrele(vp);
This provides us with a way to get the vnode without locking anything and it's easy to reason about.