Change Details

Before the actual patch: all struct mount counters can and should be split per-cpu, I have a complete patch which does that (i.e. ref, lockref, witeopcount) and passes my runs of stress2. The patch below is a logical step towards that goal. Contention on struct mount mtx is one of two top bottlenecks during buildkernel on tmpfs. Basic idea is to provide a "fast path" mode where code just manipulating counters can get away without the mnt lock. Fast path is protected with an rmlock. Any time something altering the mount point is to be done (e.g. unmounting, suspending writes) fast path can be disabled and everyone is back to using the current code. Disabling the fast path takes the rmlock for writing and bumps the disabling counter. Doing this provides an invariant that anyone touching the struct from this point on will see the fast path disabled and will fall back to the slow path (taking the mutex) -- no counters get modified or wake ups performed without the lock. For safety and simplicity mount points are allocated with fast path disabled which has to be enabled later, after the code is done setting everything up. This patch contains both introduction of the rmlock and switch of mnt_ref to atomics (and a further patch will switch them to per-cpu). It can be spilt into two. I'm not entirely fond of rmlocks here and the entire thing can be made faster single-threaded at cost of increased complexity. I think this provides a good enough solution for the time being with all the smp win for the common case. I have an experimental patch which uses a hand-rolled barrier instead of an rmlock which I can post later. Disabling the fast path all the way is coarse grained but fine for now. It can be easily changed to use disable features based on flags or something else of the sort. Any single-threaded improvements can be done later. Note that with sufficient hackery the fast path can bump the counter and then backpedal from it if necessary, but that's too much of a jump for now imo. rmlock can be used for other things as well, e.g. opt-in root vnode caching in similar manner as implemented for zfs in https://reviews.freebsd.org/D17233 Note the patch is generated on top of https://reviews.freebsd.org/D21411 With these patches struct mount mtx almost disappears from profiles.

The previous posting included an rmlock which I considered only a temporary measure from the get go, included specifically to avoid discussions on how to specifically handle per-cpu counting. It turned out to work very differently, thus after a discussion with Jeff I'm posting an update review with more rationale. The main problem to be solved is handling of struct mount counters. They are very frequently modified and almost never need to be checked, which makes them a natural fit for a per-cpu-based scheme. The approach implemented here effectively provides support for code section during which we are guaranteed the modifying party will wait for us to finish (and the other way around, we guarantee to not modify anything past a certain point). This automatically lends itself to solve other problems (most notably root vnode caching) and is therefore my first pick. There are many different ways to do it of course. One proposed by jeff@ is to cmpset into the counter. I disagree with this method for the following reasons: - I find it a little cumbersome to use - it has to account for a case where the count would overflow/underflow which avoidably extends the common case - busy/unbusy and write suspension always modify two counters at the same time. with approach proposed here all the provisions have to be done twice. with my approach we have to provide safety only once. iow this approach is slower for common consumers due to 2 atomic ops and more branches - but most importantly this does not provide the idea of code section which can be safely executed in face of changes to the struct, which I think is an unnecessary loss The patch below was tested by me with stress2 and poudriere on amd64. On top of that I ran crossmp8 and suj11 from said suite on a powerpc64 box (talos), no problems seen. Currently the code is rather coarse grained: either all per-cpu oepration is allowed or none is. This can be easily modified later with flags passed by the blocking thread and flags passed by threads entering. root vnode caching would look like this: ``` if (!vfs_op_thread_enter(mp)) return (vfs_cache_root_fallback(mp, flags, vpp)); vp = (struct vnode *)atomic_load_ptr(&mp->mnt_rootvnode); if (vp == NULL || (vp->v_iflag & VI_DOOMED)) { vfs_op_thread_exit(mp); return (vfs_cache_root_fallback(mp, flags, vpp)); } vrefact(vp); vfs_op_thread_exit(mp); error = vn_lock(vp, flags); ..... ``` Should the vnode need to be cleared it can look like this (assuming only one thread can do it): ``` vp = mp->mnt_rootvnode; if (vp == NULL) return (NULL); mp->mnt_rootvnode = NULL; atomic_thread_fence_seq_cst(); vfs_op_barrier_wait(mp); vrele(vp); ``` This provides us with a way to get the vnode without locking anything and it's easy to reason about.

Before the actual patch: all struct mount counters can and should be split per-cpu, I have a complete patch which does that (i.e. refThe previous posting included an rmlock which I considered only a temporary measure from the get go, lockref,included specifically to avoid discussions on how to specifically handle per-cpu counting. witeopcount) and passes my runs of stress2.It turned out to work very differently, The patch below is a logical step towards that goalthus after a discussion with Jeff I'm posting an update review with more rationale. Contention on struct mount mtx is one of two top bottlenecks during buildkernel on tmpfsThe main problem to be solved is handling of struct mount counters. They are very frequently modified and almost never need to be checked, which makes them a natural fit for a per-cpu-based scheme. The approach implemented here effectively provides support for code section during which we are guaranteed the modifying party will wait for us to finish (and the other way around, we guarantee to not modify anything past a certain point). This automatically lends itself to solve other problems (most notably root vnode caching) and is therefore my first pick. Basic idea is to provide a "fast path" mode where code just manipulating counters can get away withoutThere are many different ways to do it of course. One proposed by jeff@ is to cmpset into the mnt lockcounter. Fast path is protected with an rmlockI disagree with this method for the following reasons: - I find it a little cumbersome to use - it has to account for a case where the count would overflow/underflow which avoidably extends the common case - busy/unbusy and write suspension always modify two counters at the same time. Any time something altering the mount point iswith approach proposed here all the provisions have to be done twice. with my approach we have to be done (e.gprovide safety only once. unmountingiow this approach is slower for common consumers due to 2 atomic ops and more branches - but most importantly this does not provide the idea of code section which can be safely executed in face of changes to the struct, suspending writes) fast path can be disabled and everyone is back to using the current code.which I think is an unnecessary loss Disabling the fast path takes the rmlock for writing and bumps the disabling counter.The patch below was tested by me with stress2 and poudriere on amd64. On top of that I ran crossmp8 and suj11 from said suite on a powerpc64 box (talos), Doing this provides an invariant that anyone touching the struct from this point on will see the fast path disabled and will fall back to the slow path (taking the mutex) -- no counters get modified or wake ups performed without the lockno problems seen. For safety and simplicity mount points are allocated with fast path disabled which has to be enabled later,Currently the code is rather coarse grained: either all per-cpu oepration is allowed or none is. afThis can be easily modified later the code is done setting everything upwith flags passed by the blocking thread and flags passed by threads entering. This patch contains both introduction of the rmlock and switch of mnt_ref to atomics (and a further patch will switch them to per-cpu). It can be spilt into two.root vnode caching would look like this: I'm not entirely fond of rmlocks here and the entire thing can be made faster single-threaded at cost of increased complexity.``` if (!vfs_op_thread_enter(mp)) return (vfs_cache_root_fallback(mp, I think this provides a good enough solution for the time being with all the smp win for the common case.flags, I have an experimental patch which uses a hand-rolled barrier instead of an rmlock which I can post later.vpp)); vp = (struct vnode *)atomic_load_ptr(&mp->mnt_rootvnode); if (vp == NULL || (vp->v_iflag & VI_DOOMED)) { vfs_op_thread_exit(mp); return (vfs_cache_root_fallback(mp, Disabling the fast path all the way is coarse grained but fine for now.flags, It can be easily changed to use disable features based on flags or something else of the sort.vpp)); } vrefact(vp); vfs_op_thread_exit(mp); error = vn_lock(vp, Any single-threaded improvements can be done later. Note that with sufficient hackery the fast path can bump the counter and then backpedal from it if necessary, but that's too much of a jump for now imo.flags); ..... ``` rmlock can be used for other things as well, e.g. opt-in root vnode caching in similar manner as implemented for zfs in https://reviews.freebsd.org/D17233Should the vnode need to be cleared it can look like this (assuming only one thread can do it): Note the patch is generated on top of https://reviews.freebsd.org/D21411``` vp = mp->mnt_rootvnode; if (vp == NULL) return (NULL); mp->mnt_rootvnode = NULL; atomic_thread_fence_seq_cst(); vfs_op_barrier_wait(mp); vrele(vp); ``` With these patches struct mount mtx almost disappears from profilesThis provides us with a way to get the vnode without locking anything and it's easy to reason about.