Page MenuHomeFreeBSD

mnt_vnode_next_active: use conventional lock order when trylock fails
ClosedPublic

Authored by rlibby on May 12 2017, 8:30 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mar 3 2024, 2:40 PM
Unknown Object (File)
Feb 25 2024, 7:34 AM
Unknown Object (File)
Dec 22 2023, 10:40 PM
Unknown Object (File)
Nov 14 2023, 2:13 AM
Unknown Object (File)
Nov 6 2023, 3:29 PM
Unknown Object (File)
Nov 6 2023, 1:06 AM
Unknown Object (File)
Nov 5 2023, 12:33 AM
Unknown Object (File)
Oct 13 2023, 1:17 AM

Details

Summary

Previously, when the VI_TRYLOCK failed, we would spin under the mutex
that protects the vnode active list until we either succeeded or
noticed that we had hogged the CPU. Since we were violating the lock
order, this would guarantee that we would become a hog under any
deadlock condition (e.g. a race with vdrop(9) on the same vnode). In
the presence of many concurrent threads in sync(2) or vdrop etc, the
victim could hang for a long time.

Now, avoid spinning by dropping and reacquiring the locks in the
conventional lock order when the trylock fails. This requires a dance
with the vnode hold count.

Test Plan
rm -rf /var/foo
fsstress -d /var/foo -l 0 -n 10000 -p 50 &
for i in {1..15}; do sleep 60; time sync; done

fsstress does some sync(2)s and these will sometimes become ensnared
in the VI_TRYLOCK spin. This is not a serious performance test and the
time sync is not a great measurement but it does show occasional
delays. a/b here are before/after patch:

$ ministat a b
x a
+ b
+------------------------------------------------------------------------------+
|    ++                                                                        |
| ++ ++  x                                                                     |
| +**+*x+** x +    x    x  x    x      x             x                        x|
||_|_MA__|__M_________A____________________|                                   |
+------------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x  15         0.075          1.38         0.239    0.40873333     0.3664479
+  15         0.057          0.26         0.113    0.12513333   0.055064724

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/kern/vfs_subr.c
5399 ↗(On Diff #28267)

After this moment, the vp vnode could be reclaimed by other thread and no longer belong to the specified mount.

Also, why do you use acquire_if_not_zero ? In principle, we try to keep on the active list only vnodes with v_holdcnt >0, but this is not a strong invariant. IMO normal vhold() should be used there.

sys/kern/vfs_subr.c
5399 ↗(On Diff #28267)

After this moment, the vp vnode could be reclaimed by other thread and no longer belong to the specified mount.

It can be vgone'd and removed from the mount, but our hold should should prevent it being freed. I check for that case after taking the VI, it is the VNASSERT and if ((vp->v_iflag & VI_ACTIVE) == 0 check below.

Also, why do you use acquire_if_not_zero ? In principle, we try to keep on the active list only vnodes with v_holdcnt >0, but this is not a strong invariant. IMO normal vhold() should be used there.

Because in the case where v_holdcnt == 0 it would try to take the VI which is a LOR and could deadlock.

Besides that, to drop our hold back to neutral below we still need vfs_refcount_release_if_not_last since we are not willing to give up our VI. If we take that as given then the paired vfs_acquire_if_not_zero doesn't seem unnatural to me.

sys/kern/vfs_subr.c
5399 ↗(On Diff #28267)

I mean, explicitely check that vp->v_mount == mp after the locks are obtained, and abort if not.

After your note, I agree with the use of acq_if_not().

sys/kern/vfs_subr.c
5399 ↗(On Diff #28267)

Okay, I can add v_mount == mp as an abort condition.

I am a little confused though, is that not redundant? My understanding is that VI_ACTIVE implies v_mount == mp, but v_mount == mp doesn't imply active (i.e. may be VI_FREE). So the active check seems to be necessary (because this is an iterator for active vnodes and for active list traversal) and sufficient (in terms of v_mount). I think that v_mount == mp is covered by the VNASSERT but as that is INVARIANTS-only, I could also add a hard assert if (active && !v_mount != mp) panic(). Or maybe just further clarify the comments?

What's your preference here?

sys/kern/vfs_subr.c
5399 ↗(On Diff #28267)

You comment make me suggest a different check: instead of VI_ACTIVE and mp checks, abort if TAILQ_NEXT(tmp) != vp. This will cover reclaim, inactivation and even a theoretical reordering of the active list. I do not think that it is important to skip markers, as in the INVARIANTS block below.

And the INVARIANTS check is not needed after that.

sys/kern/vfs_subr.c
5399 ↗(On Diff #28267)

I'm happy to arrange it that way if that's how you prefer. I like that that method is more robust in that it relies on much simpler logic. I had opted for the VI_ACTIVE check because it doesn't require acquiring another lock just to determine that you need to abort. (On the other hand that lock is acquired anyway in the success path so it's probably not a big deal.)

I do think that the "same position" check should still skip markers though. The reason is that markers can be expected to pile up in the active list in front of a vnode when its VI lock is held for a long time (consider the case with a long-held VI and "many" active vnode iterators).

Incidentally, I think that reordering of our vnode in the active list should not happen currently (excepting markers) since our hold count should prevent our vnode from being dropped and readded. So I think both approaches are correct and safe.

sys/kern/vfs_subr.c
5399 ↗(On Diff #28267)

I am fine with adding the skip to the check of the same position.

kib feedback: just recheck position relative to marker in active list

This revision is now accepted and ready to land.May 12 2017, 9:30 PM

This change is a nice piece of work. Thanks for figuring it out. Going through your discussion with kib on the subtleties of the flags and hold count semantics makes me wish that it was simpler, but I do not see a way to do so and still be able to have the (very useful) semantics of vgone().

Peter, could you, please, test this change ? The interesting load is the same as was used for testing r244240, and the UP config is probably most important, but I do not expect surprises.

In D10692#221813, @kib wrote:

Peter, could you, please, test this change ? The interesting load is the same as was used for testing r244240, and the UP config is probably most important, but I do not expect surprises.

OK, I'll do that.

Ran all of the stress2 tests on MP + a buildworld.
For the UP test I ran a buildworld + the tests used for testing r244240.
No problems seen.

This revision was automatically updated to reflect the committed changes.