Specifically, do not let vtryrecycle() to recycle a used vnode. It is possible for a vnode to be vref-ed or vuse-ed lockless after it is held by vhold_recycle_free(). Then, since vtryrecycle() does not recheck the hold count, we might end up freeing vused vnode. Since vget_finish() increments v_usecount after obtaining the vnode lock, we would observe the hold reference anyway when the parallel vget() is blocked waiting on the vnode lock. PR: 281749 Reported and tested by: Steve Peurifoy <ssw01@mathistry.net>, vova@zote.me
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
| sys/kern/vfs_subr.c | ||
|---|---|---|
| 1943 | The intent of vtryrecycle() seems to be that only usecount should prevent vnlru reclaim. If we have other thread only calling vhold(), the vnode is still eligible for vnlru processing. Since I do not see a way to distinguish half-done vget() vs plain vhold(), I added the word 'unneccessary' there. | |
| 1945 | It might be it is possible to move interlock later, but this is quite delicate IMO. The reason why it might be correct is that the current implementation of vdrop() performs the 1->0 decrement under the interlock, but not any n->(n-1) for n > 1. Since vnlru processing owns one hold count, we are guaranteed that a parallel vdrop() does not interfere with the test. But this is not something I want to do in this change. It also adds very obtuse dependency on the vdrop() implementation detail. I think a single memory load under the interlock is fine, after all. Most of the time the reclamation would consume much more resources. | |
| 1946 | Why? Neither v_usecount nor v_holdcnt are refcounts. | |
| sys/kern/vfs_subr.c | ||
|---|---|---|
| 1939–1943 | I'd suggest reformulating along these lines to clear a possible confusion. | |
| 1945 |
It seems it could be acquired after unlocking the vnode. But I tend to agree with kib@, if touching it at all, that should be a separate revision/commit for safety. | |
| 1946 | This improves the situation. However, it simply cannot fix the problem in all generality. There's always the problem that some thread can try to get an active reference (vref(), vget()) while the check above did not pass and we are heading to vgonel() below (as getting an active reference is essentially lockless as the first action is to vhold() (in vget_prep())). It's high time the whole vnode reference machinery is revised (I'd love to but am and will be unavailable for this task for a long time). | |
| 1946 | Checking for the use count is redundant. | |
This greatly reduces the race window, but it does not fix the problem.
Since vhold is fully lockless, lookup can bump it just after vtryrecycle determined vp->v_holdcnt == 1 giving you the same race.
The fix would be to patch the namecache to undo whatever it did and pretend there was no vnode found to begin with, resulting in a fallback to the filesystem-specific lookup. Then the fs can resurrect the inode.
Note this is a race inherently existing versus lookup, it was always there.
As detailed in an inline comment, this patch lowers the probability of the race, but does not make it disappear. The real problem is probably more along the lines of some code doing a vget() and assuming the obtained vnode is not doomed, which cannot be guaranteed in general.
Committing the current patch (ideally with suggested changes) is still fine (improves the situation).
Note that UFS does not have the issue, because vfs hash provides the safe way to either get the reference on the free vnode or observe it marked doomed.
So I do not think that either namecache or generic vnode lifecycle there is to blame.
ZFS should utilize some mechanism to ensure that.
Yes, I finally came to that too, in the previous herald comment. That a vnode can be doomed in parallel is unavoidable when there's no use count initially.
ZFS should utilize some mechanism to ensure that.
Yes. But where it does happen is what to search for (didn't follow the bug itself).