Paths

Table of Contentst

Differential D10692

mnt_vnode_next_active: use conventional lock order when trylock fails
ClosedPublic
Actions

Authored by rlibby on May 12 2017, 8:30 AM.

Details

Reviewers

kib

Commits

rS318285: mnt_vnode_next_active: use conventional lock order when trylock fails.

Summary

Previously, when the VI_TRYLOCK failed, we would spin under the mutex
that protects the vnode active list until we either succeeded or
noticed that we had hogged the CPU. Since we were violating the lock
order, this would guarantee that we would become a hog under any
deadlock condition (e.g. a race with vdrop(9) on the same vnode). In
the presence of many concurrent threads in sync(2) or vdrop etc, the
victim could hang for a long time.

Now, avoid spinning by dropping and reacquiring the locks in the
conventional lock order when the trylock fails. This requires a dance
with the vnode hold count.

Test Plan

rm -rf /var/foo
fsstress -d /var/foo -l 0 -n 10000 -p 50 &
for i in {1..15}; do sleep 60; time sync; done

fsstress does some sync(2)s and these will sometimes become ensnared
in the VI_TRYLOCK spin. This is not a serious performance test and the
time sync is not a great measurement but it does show occasional
delays. a/b here are before/after patch:

$ ministat a b
x a
+ b
+------------------------------------------------------------------------------+
|    ++                                                                        |
| ++ ++  x                                                                     |
| +**+*x+** x +    x    x  x    x      x             x                        x|
||_|_MA__|__M_________A____________________|                                   |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  15         0.075          1.38         0.239    0.40873333     0.3664479
+  15         0.057          0.26         0.113    0.12513333   0.055064724

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

rlibby created this revision.May 12 2017, 8:30 AM

rlibby edited the test plan for this revision. (Show Details)May 12 2017, 8:34 AM

kib added inline comments.May 12 2017, 9:35 AM

sys/kern/vfs_subr.c
5399 ↗	(On Diff #28267)	After this moment, the vp vnode could be reclaimed by other thread and no longer belong to the specified mount. Also, why do you use acquire_if_not_zero ? In principle, we try to keep on the active list only vnodes with v_holdcnt >0, but this is not a strong invariant. IMO normal vhold() should be used there.

rlibby added inline comments.May 12 2017, 4:02 PM

sys/kern/vfs_subr.c
5399 ↗	(On Diff #28267)	After this moment, the vp vnode could be reclaimed by other thread and no longer belong to the specified mount. It can be vgone'd and removed from the mount, but our hold should should prevent it being freed. I check for that case after taking the VI, it is the `VNASSERT` and `if ((vp->v_iflag & VI_ACTIVE) == 0` check below. Also, why do you use acquire_if_not_zero ? In principle, we try to keep on the active list only vnodes with v_holdcnt >0, but this is not a strong invariant. IMO normal vhold() should be used there. Because in the case where v_holdcnt == 0 it would try to take the VI which is a LOR and could deadlock. Besides that, to drop our hold back to neutral below we still need `vfs_refcount_release_if_not_last` since we are not willing to give up our VI. If we take that as given then the paired `vfs_acquire_if_not_zero` doesn't seem unnatural to me.

kib added inline comments.May 12 2017, 5:14 PM

sys/kern/vfs_subr.c
5399 ↗	(On Diff #28267)	I mean, explicitely check that vp->v_mount == mp after the locks are obtained, and abort if not. After your note, I agree with the use of acq_if_not().

rlibby added inline comments.May 12 2017, 5:58 PM

sys/kern/vfs_subr.c
5399 ↗	(On Diff #28267)	Okay, I can add `v_mount == mp` as an abort condition. I am a little confused though, is that not redundant? My understanding is that VI_ACTIVE implies v_mount == mp, but v_mount == mp doesn't imply active (i.e. may be VI_FREE). So the active check seems to be necessary (because this is an iterator for active vnodes and for active list traversal) and sufficient (in terms of v_mount). I think that v_mount == mp is covered by the VNASSERT but as that is INVARIANTS-only, I could also add a hard assert `if (active && !v_mount != mp) panic()`. Or maybe just further clarify the comments? What's your preference here?

kib added inline comments.May 12 2017, 6:14 PM

sys/kern/vfs_subr.c
5399 ↗	(On Diff #28267)	You comment make me suggest a different check: instead of VI_ACTIVE and mp checks, abort if TAILQ_NEXT(tmp) != vp. This will cover reclaim, inactivation and even a theoretical reordering of the active list. I do not think that it is important to skip markers, as in the INVARIANTS block below. And the INVARIANTS check is not needed after that.

rlibby added inline comments.May 12 2017, 7:51 PM

sys/kern/vfs_subr.c
5399 ↗	(On Diff #28267)	I'm happy to arrange it that way if that's how you prefer. I like that that method is more robust in that it relies on much simpler logic. I had opted for the VI_ACTIVE check because it doesn't require acquiring another lock just to determine that you need to abort. (On the other hand that lock is acquired anyway in the success path so it's probably not a big deal.) I do think that the "same position" check should still skip markers though. The reason is that markers can be expected to pile up in the active list in front of a vnode when its VI lock is held for a long time (consider the case with a long-held VI and "many" active vnode iterators). Incidentally, I think that reordering of our vnode in the active list should not happen currently (excepting markers) since our hold count should prevent our vnode from being dropped and readded. So I think both approaches are correct and safe.

kib added inline comments.May 12 2017, 8:16 PM

sys/kern/vfs_subr.c
5399 ↗	(On Diff #28267)	I am fine with adding the skip to the check of the same position.

kib feedback: just recheck position relative to marker in active list

Harbormaster completed remote builds in B9229: Diff 28277.May 12 2017, 8:58 PM

kib accepted this revision.May 12 2017, 9:30 PM

This revision is now accepted and ready to land.May 12 2017, 9:30 PM

This change is a nice piece of work. Thanks for figuring it out. Going through your discussion with kib on the subtleties of the flags and hold count semantics makes me wish that it was simpler, but I do not see a way to do so and still be able to have the (very useful) semantics of vgone().

kib added a subscriber: pho.May 13 2017, 11:33 AM

Peter, could you, please, test this change ? The interesting load is the same as was used for testing r244240, and the UP config is probably most important, but I do not expect surprises.

In D10692#221813, @kib wrote:

Peter, could you, please, test this change ? The interesting load is the same as was used for testing r244240, and the UP config is probably most important, but I do not expect surprises.

OK, I'll do that.

Ran all of the stress2 tests on MP + a buildworld.
For the UP test I ran a buildworld + the tests used for testing r244240.
No problems seen.

Closed by commit rS318285: mnt_vnode_next_active: use conventional lock order when trylock fails. (authored by kib). · Explain WhyMay 15 2017, 10:03 AM

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

Path

Size

head/

sys/

kern/

vfs_subr.c

98 lines

Diff 28335

View Options

mnt_vnode_next_active: use conventional lock order when trylock failsClosedPublicActions