Page MenuHomeFreeBSD

vfs: work around the race between vget() and vnlru
ClosedPublic

Authored by kib on Thu, May 28, 5:19 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jun 25, 2:56 AM
Unknown Object (File)
Mon, Jun 15, 6:34 PM
Unknown Object (File)
Sun, Jun 7, 12:17 AM
Unknown Object (File)
Fri, Jun 5, 8:28 PM
Unknown Object (File)
Fri, Jun 5, 7:18 PM
Unknown Object (File)
Fri, Jun 5, 6:26 PM
Unknown Object (File)
Fri, Jun 5, 3:00 PM
Unknown Object (File)
Fri, Jun 5, 12:20 AM
Subscribers

Details

Summary
Specifically, do not let vtryrecycle() to recycle a used vnode. It is
possible for a vnode to be vref-ed or vuse-ed lockless after it is held
by vhold_recycle_free(). Then, since vtryrecycle() does not recheck the
hold count, we might end up freeing vused vnode.

Since vget_finish() increments v_usecount after obtaining the vnode
lock, we would observe the hold reference anyway when the parallel
vget() is blocked waiting on the vnode lock.

PR:     281749
Reported and tested by: Steve Peurifoy <ssw01@mathistry.net>, vova@zote.me

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

kib requested review of this revision.Thu, May 28, 5:19 PM
kib added a reviewer: kevans.
sys/kern/vfs_subr.c
1943

I don't really understand "unneccesary" here.

1945

Is there any reason to acquire the interlock first?

1946

Use refcount_load()?

kib marked 3 inline comments as done.Thu, May 28, 8:17 PM
kib added inline comments.
sys/kern/vfs_subr.c
1943

The intent of vtryrecycle() seems to be that only usecount should prevent vnlru reclaim. If we have other thread only calling vhold(), the vnode is still eligible for vnlru processing. Since I do not see a way to distinguish half-done vget() vs plain vhold(), I added the word 'unneccessary' there.

1945

It might be it is possible to move interlock later, but this is quite delicate IMO. The reason why it might be correct is that the current implementation of vdrop() performs the 1->0 decrement under the interlock, but not any n->(n-1) for n > 1. Since vnlru processing owns one hold count, we are guaranteed that a parallel vdrop() does not interfere with the test.

But this is not something I want to do in this change. It also adds very obtuse dependency on the vdrop() implementation detail. I think a single memory load under the interlock is fine, after all. Most of the time the reclamation would consume much more resources.

1946

Why? Neither v_usecount nor v_holdcnt are refcounts.

sys/kern/vfs_subr.c
1939–1943

I'd suggest reformulating along these lines to clear a possible confusion.

1945

Is there any reason to acquire the interlock first?

It seems it could be acquired after unlocking the vnode. But I tend to agree with kib@, if touching it at all, that should be a separate revision/commit for safety.

1946

This improves the situation. However, it simply cannot fix the problem in all generality. There's always the problem that some thread can try to get an active reference (vref(), vget()) while the check above did not pass and we are heading to vgonel() below (as getting an active reference is essentially lockless as the first action is to vhold() (in vget_prep())).

It's high time the whole vnode reference machinery is revised (I'd love to but am and will be unavailable for this task for a long time).

1946

Checking for the use count is redundant.

This greatly reduces the race window, but it does not fix the problem.

Since vhold is fully lockless, lookup can bump it just after vtryrecycle determined vp->v_holdcnt == 1 giving you the same race.

The fix would be to patch the namecache to undo whatever it did and pretend there was no vnode found to begin with, resulting in a fallback to the filesystem-specific lookup. Then the fs can resurrect the inode.

Note this is a race inherently existing versus lookup, it was always there.

As detailed in an inline comment, this patch lowers the probability of the race, but does not make it disappear. The real problem is probably more along the lines of some code doing a vget() and assuming the obtained vnode is not doomed, which cannot be guaranteed in general.

Committing the current patch (ideally with suggested changes) is still fine (improves the situation).

This revision is now accepted and ready to land.Thu, May 28, 8:47 PM
kib marked 3 inline comments as done.Thu, May 28, 8:57 PM

Note that UFS does not have the issue, because vfs hash provides the safe way to either get the reference on the free vnode or observe it marked doomed.
So I do not think that either namecache or generic vnode lifecycle there is to blame.

ZFS should utilize some mechanism to ensure that.

In D57305#1313259, @kib wrote:

So I do not think (...) generic vnode lifecycle there is to blame.

Yes, I finally came to that too, in the previous herald comment. That a vnode can be doomed in parallel is unavoidable when there's no use count initially.

ZFS should utilize some mechanism to ensure that.

Yes. But where it does happen is what to search for (didn't follow the bug itself).