Paths

Table of Contentst

devfs(5): Convert global mutex to rwlock
AbandonedPublic
Actions

Authored by cem on Aug 17 2020, 4:55 PM.

Details

Reviewers

mjg
kib
markj

Summary

Mostly this is a boring change; most consumers continue to use wlock.
The only relaxed consumer (for now) is vn_isdisk(), which no longer
requires the exclusive global mutex. vn_isdisk -> global devmtx is a
significant point of contention on many-core systems with buf-heavy workloads.

Submitted by: attilio (earlier version)

Test Plan

I've elided the boring and mechanical dev_lock -> dev_wlock and
dev_lock_assert_locked -> dev_lock_assert_wlocked changes from this revision to
make it clearer where real changes are. If you want me to paste those
separately, or with both sets of changes together, let me know.

I've also kept the unr(9) conversion to accept an rwlock (it's not especially
interesting) out as a separate patch. Let me know if you want to review it,
but it's pretty trivial.

This changeset was added to OneFS in 2014, I think in large part due to
contention on dev_refthread/relthread, which used global dev_lock at the time.
Now in FreeBSD they use per-cdev mutexes, so the rwlock devlock doesn't help
us there anymore. vn_isdisk still matters quite a bit, though.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 33029
Build 30415: arc lint + arc unit

Event Timeline

cem requested review of this revision.Aug 17 2020, 4:55 PM

cem created this revision.

Harbormaster completed remote builds in B33029: Diff 75895.Aug 17 2020, 4:55 PM

For the unr(9) changes, did you provide an interface that accepts an arbitrary lock_object, the way _sleep(), _callout_init_lock(), etc. work?

sys/kern/kern_conf.c
147	Why not dev_wunlock()?
sys/kern/kern_mutex.c
1227	Why does devlock need to be initialized so early?

I disagree with the change. If locking at hand is a problem the code should be reworked to not need it in the first place.

I will note most consumers don't care for a specific error code, just whether it is a device or not.

One trivial idea would be to add a flag to ->v_irflag maintained by devfs. For non-VCHR there is already nothing to do. For VCHR I would note devices have a backpointer to all associates vnodes. Should the vnode point to a "disk" it can get a VIRF_ISDISK flag or similar. Should anything go wrong with the device, pointers get NULLified and whatnot, a loop over said vnodes can unset the flag. This would satisify almost everyone.

The remaining cases can call the routine in the current form. Should there be a frequent consumer which also wants a specific error further changes can be made.

To extend the point the fast path would be +/-:

diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index 2df609d9118..0bb97e981ae 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -4977,8 +4977,15 @@ vn_need_pageq_flush(struct vnode *vp)
 /*
  * Check if vnode represents a disk device
  */
+bool
+vn_isdisk(struct vnode *vp)
+{
+
+       return ((vp->v_irflag & VIRF_ISDISK) != 0);
+}
+
 int
-vn_isdisk(struct vnode *vp, int *errp)
+vn_isdisk_error(struct vnode *vp, int *errp)
 {
        int error;
 
@@ -4994,10 +5001,11 @@ vn_isdisk(struct vnode *vp, int *errp)
                error = ENXIO;
        else if (!(vp->v_rdev->si_devsw->d_flags & D_DISK))
                error = ENOTBLK;
+       if (error == 0)
+               MPASS(vn_isdisk(vp));
        dev_unlock();
 out:
-       if (errp != NULL)
-               *errp = error;
+       *errp = error;
        return (error == 0);
 }

with everyone passing NULL for error sticking to one-argument vn_isdisk (can even become an inline). The few other cases can use vn_isdisk_error.

Of course the real work is making devfs cooperate here, but I think this is the way to go.

I presume the problematic consumer in your workload is vfs_bio.c. It rolls with:

bsize = vn_isdisk(vp, NULL) ? DEV_BSIZE : bo->bo_bsize;

is the call really necessary to begin with? And is DEV_BSIZE really the right thing to use here? Perhaps the right size can already be there in bo_bsize?

In D26086#579112, @markj wrote:

For the unr(9) changes, did you provide an interface that accepts an arbitrary lock_object, the way _sleep(), _callout_init_lock(), etc. work?

Yes, as a static helper inside subr_unr.c; not a published API.

In D26086#579115, @mjg wrote:

I disagree with the change. If locking at hand is a problem the code should be reworked to not need it in the first place.

Sure, that's fine. Setting some flag on the vnode is not too bad; the harder part is invalidation when disks go away (if we move away from the dev_lock synchronization currently used for both destroy_devl and vn_isdisk).

I will note most consumers don't care for a specific error code, just whether it is a device or not.

Agreed, that's all that really matters.

One trivial idea would be to add a flag to ->v_irflag maintained by devfs. For non-VCHR there is already nothing to do. For VCHR I would note devices have a backpointer to all associates vnodes.

What's the backpointer?

sys/kern/kern_conf.c
147	No good reason. I simply forgot to do in my mechanical transformation patch and didn't want to add the churn later. I'd be happy to change it.
sys/kern/kern_mutex.c
1227	I'm not sure if it does, but given devmtx was already initialized around this time, it seems relatively harmless.

In D26086#579135, @mjg wrote:

Of course the real work is making devfs cooperate here

Right.

In D26086#579164, @mjg wrote:

I presume the problematic consumer in your workload is vfs_bio.c. It rolls with:

Yeah, there are about four consumers in vfs_bio and they are all likely hot in our workloads. getblk/brelse paths.

bsize = vn_isdisk(vp, NULL) ? DEV_BSIZE : bo->bo_bsize;
is the call really necessary to begin with? And is DEV_BSIZE really the right thing to use here? Perhaps the right size can already be there in bo_bsize?

DEV_BSIZE is really the right thing, although I don't remember why bo_bsize wasn't set to DEV_BSIZE instead. Maybe we can set bo_bsize to match DEV_BSIZE when initializing bufobjs for disks. I don't know.

Either way, there are a few other consumers that wouldn't be addressed by fixing bo_bsize, so we need a cheap version of this anyway.

Some historical context around mismatched bo_bsize: https://svnweb.freebsd.org/base?view=revision&revision=206097

In D26086#579224, @cem wrote:

In D26086#579115, @mjg wrote:

One trivial idea would be to add a flag to ->v_irflag maintained by devfs. For non-VCHR there is already nothing to do. For VCHR I would note devices have a backpointer to all associates vnodes.

What's the backpointer?

Is that de->de_vnode / vp->v_data? Yeah, that works.

We already similarly grab VV_ISTTY out of d_flags & D_TTY.

See devfs_revoke. All vnodes from all mount points need to be updated. Some refactoring may be needed to make it feasible.

Alternatively, if that runs into too many woes, as a half measure the flag could be implemented in struct cdev. Then vn_isdisk would get away with taking vnode interlock.

I looked at all vn_isdisk() consumers in vfs_bio.c, ufs/ffs, and vm/. As was correctly noted, vn_isdisk() provides one of three results: disk/not disk/device destroyed. I believe that 'device destroyed' answer is accidental to the way vn_isdisk() currently works and in fact is badly handled for vfs_bio.c and vm/vnode_pager.c. The only place where 'disk destroyed' (AKA ENXIO) result could have some relevance is ffs mount code, where it short-circuits whole mount.

For vfs_bio.c, in fact taking the ENXIO case into account, as it is done now, causes consistency issues. Similarly, for vnode_pager.c, vn_isdisk() is used to gate creation of the devvp v_object where I do not see a reason to deny it after ENXIO.

So I think that mjg' idea of VIRF_ISDISK flag is worth trying, and the complex synchronization issues could be ignored for hot path because hot path really better served by consistent answer than by the notice that devvp was destroy_dev'ed.

Happily obviated by D26119

In D26086#579112, @markj wrote:

For the unr(9) changes, did you provide an interface that accepts an arbitrary lock_object, the way _sleep(), _callout_init_lock(), etc. work?

I don't have any use for the unr changes, but in case they are interesting to someone in the future:

https://people.freebsd.org/~cem/unr.lock.abstraction.1of2.patch
https://people.freebsd.org/~cem/unr.lock.abstraction.2of2.patch