vfs: work around the race between vget() and vnlru
ClosedPublic
Actions

Authored by kib on May 28 2026, 5:19 PM.

Details

Reviewers

markj
jah
kevans
olce

Commits

rG36b155a2b3ba: vfs: work around the race between vget() and vnlru

Summary

Specifically, do not let vtryrecycle() to recycle a used vnode. It is
possible for a vnode to be vref-ed or vuse-ed lockless after it is held
by vhold_recycle_free(). Then, since vtryrecycle() does not recheck the
hold count, we might end up freeing vused vnode.

Since vget_finish() increments v_usecount after obtaining the vnode
lock, we would observe the hold reference anyway when the parallel
vget() is blocked waiting on the vnode lock.

PR:     281749
Reported and tested by: Steve Peurifoy <ssw01@mathistry.net>, vova@zote.me

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kib created this revision.May 28 2026, 5:19 PM

Herald added subscribers: olce, imp. · View Herald TranscriptMay 28 2026, 5:19 PM

kib requested review of this revision.May 28 2026, 5:19 PM

kib added a reviewer: kevans.

olce added a reviewer: olce.May 28 2026, 7:18 PM

markj added inline comments.May 28 2026, 7:42 PM

sys/kern/vfs_subr.c
1943	I don't really understand "unneccesary" here.
1945	Is there any reason to acquire the interlock first?
1946	Use refcount_load()?

kib marked 3 inline comments as done.May 28 2026, 8:17 PM

kib added inline comments.

sys/kern/vfs_subr.c
1943	The intent of vtryrecycle() seems to be that only usecount should prevent vnlru reclaim. If we have other thread only calling vhold(), the vnode is still eligible for vnlru processing. Since I do not see a way to distinguish half-done vget() vs plain vhold(), I added the word 'unneccessary' there.
1945	It might be it is possible to move interlock later, but this is quite delicate IMO. The reason why it might be correct is that the current implementation of vdrop() performs the 1->0 decrement under the interlock, but not any n->(n-1) for n > 1. Since vnlru processing owns one hold count, we are guaranteed that a parallel vdrop() does not interfere with the test. But this is not something I want to do in this change. It also adds very obtuse dependency on the vdrop() implementation detail. I think a single memory load under the interlock is fine, after all. Most of the time the reclamation would consume much more resources.
1946	Why? Neither v_usecount nor v_holdcnt are refcounts.

olce added inline comments.May 28 2026, 8:28 PM

sys/kern/vfs_subr.c
1939–1943	I'd suggest reformulating along these lines to clear a possible confusion.
1945	Is there any reason to acquire the interlock first? It seems it could be acquired after unlocking the vnode. But I tend to agree with `kib@`, if touching it at all, that should be a separate revision/commit for safety.
1946	This improves the situation. However, it simply cannot fix the problem in all generality. There's always the problem that some thread can try to get an active reference (`vref()`, `vget()`) while the check above did not pass and we are heading to `vgonel()` below (as getting an active reference is essentially lockless as the first action is to vhold() (in vget_prep())). It's high time the whole vnode reference machinery is revised (I'd love to but am and will be unavailable for this task for a long time).
1946	Checking for the use count is redundant.

This greatly reduces the race window, but it does not fix the problem.

Since vhold is fully lockless, lookup can bump it just after vtryrecycle determined vp->v_holdcnt == 1 giving you the same race.

The fix would be to patch the namecache to undo whatever it did and pretend there was no vnode found to begin with, resulting in a fallback to the filesystem-specific lookup. Then the fs can resurrect the inode.

Note this is a race inherently existing versus lookup, it was always there.

As detailed in an inline comment, this patch lowers the probability of the race, but does not make it disappear. The real problem is probably more along the lines of some code doing a vget() and assuming the obtained vnode is not doomed, which cannot be guaranteed in general.

Committing the current patch (ideally with suggested changes) is still fine (improves the situation).

This revision is now accepted and ready to land.May 28 2026, 8:47 PM

Note that UFS does not have the issue, because vfs hash provides the safe way to either get the reference on the free vnode or observe it marked doomed.
So I do not think that either namecache or generic vnode lifecycle there is to blame.

ZFS should utilize some mechanism to ensure that.

In D57305#1313259, @kib wrote:

So I do not think (...) generic vnode lifecycle there is to blame.

Yes, I finally came to that too, in the previous herald comment. That a vnode can be doomed in parallel is unavoidable when there's no use count initially.