commit 0978cc1c2b53ba1cce573e62b6807ca6a5efb9c3 (HEAD -> vdrop_def3) Author: Mateusz Guzik <mjg@FreeBSD.org> Date: Fri Sep 19 18:22:12 2025 +0000 vfs: make sure vnodes are not getting freed when they are locked The layer has an entrenched expectation that it is fine to unlock a vnode while no longer holding a reference on it. While this happens to work of the time, this can race against other threads releasing references. For example in vput: if (!refcount_release(&vp->v_usecount)) { VOP_UNLOCK(vp); return; } Another thread can race the unlock to call vput_final and eventually reture the vnode, before VOP_UNLOCK manages to finish. See the commit for a more detailed explanation. Fix the problem by waiting for locked operation to drain. commit 416a19154df649ababf8bae04ed741e165df137d Author: Mateusz Guzik <mjg@FreeBSD.org> Date: Mon Sep 22 14:09:51 2025 +0000 vfs: reduce indentation in vput_final nfc
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
Too late here to review this properly now. But see my last comment in D52608, which I posted not knowing about this new revision. I don't think we actually need to guarantee that vdropl() cannot free a vnode that is locked, as this case already should not happen in our tree. It seems that guaranteeing that while executing vput_final() is enough, which a priori doesn't require this more involved implementation (but I didn't do the exercise).
If your idea boils down to adding some form of lock drain to iput_final, it's not going to work.
- there are legal ways for new users to show up
- not everyone calling into vput_final has the lock held. adding a wait on such a lock can result in deadlocks
My patch only checks for the lock state when it is no longer legal for anyone to get a *new* lock. At worst, there may be pre-existing holders to wait out.
Perhaps one could check if this is the last v_holdcnt and use the current deferred inactive machinery to cover it, but it sounds too brittle to me.
True. But we could use the already existing deferred inactive mechanism to avoid that problem.
- there are legal ways for new users to show up
Which ones?
This patch seems to fix the panics I referred to in my revision. Will you finish this patch?
This is finished, it just has extra parts to trigger defered drop + a printf that it happened. I'll upload a cleaned up variant shortly.
Did you verify you get the printfs from vdrop_doomed_process in your testcase?
Yeah, while running the test, every so often I see a burst of
vdrop_doomed_process: processing 0xfffffe005f834528 vdrop_doomed_process: processing 0xfffffe005fbc9898 vdrop_doomed_process: processing 0xfffffe005f8e8000 vdrop_doomed_process: processing 0xfffffe005f81ac08 vdrop_doomed_process: processing 0xfffffe005f7736e0 vdrop_doomed_process: processing 0xfffffe005f818370 vdrop_doomed_process: processing 0xfffffe005f81aa50 vdrop_doomed_process: processing 0xfffffe005f7741b8 vdrop_doomed_process: processing 0xfffffe005f807000 vdrop_doomed_process: processing 0xfffffe005f837dc0 vdrop_doomed_process: processing 0xfffffe005f7731b8
sys/kern/vfs_subr.c | ||
---|---|---|
327 | These can be static. |
Why not put the defer action into vput_final() (and possibly use the already existing deferred inactive mechanism)?
By doing the defer in vdrop(), the extra thing you're doing is guarding against dropping a "preservation" reference (v_holdcnt) while a vnode is still locked and needs to be unlocked. Are we really having such cases in the tree? I've just checked vdrop() and vdropl() calls in tree and I see none.
For a non-doomed vnode new users can pop up and disappear. If the vnode is doomed, the mechanism is not operational as ->v_mount is cleared.
By doing the defer in vdrop(), the extra thing you're doing is guarding against dropping a "preservation" reference (v_holdcnt) while a vnode is still locked and needs to be unlocked. Are we really having such cases in the tree? I've just checked vdrop() and vdropl() calls in tree and I see none.
This is guarding against the race as described in the commit message and the commentary in the patch. vput itself is an immediate example and there are presumably vunref consumers which also suffer the problem.
You're repeating yourself, and I repeat: Not in a way that is relevant. If I'm wrong, fine, please show me a case that you think is relevant and is not covered by handling this in vput_final(). And, anyway, deferral does not need to occur at all for non-doomed vnode.
If the vnode is doomed, the mechanism is not operational as ->v_mount is cleared.
It's not too difficult to amend it to behave differently on VI_DEFDROP, is it?
This is guarding against the race as described in the commit message and the commentary in the patch. vput itself is an immediate example and there are presumably vunref consumers which also suffer the problem.
Yes. What I'm after here is that the patch here does more than that, by also guarding against the race I've just described, which is non-existent. This adds a VOP_ISLOCKED() to every vdropl_final() whereas this is not necessary, that's my point.
It is legal for a vnode to gain hold count while doomed and it is legal to lock afterwards.
For a specific case see vnlru_free_impl which ultimately calls vhold_recycle_free. Note patching up this specific consumer does not automatically resolve the problem, the entire kernel would have to be audited.
If the vnode is doomed, the mechanism is not operational as ->v_mount is cleared.
It's not too difficult to amend it to behave differently on VI_DEFDROP, is it?
Doing so would require implementing a custom deferral processing routine, which constitutes vast majority of this patch. It was my impression you were protesting the patch on this ground, assuming the current deferral can be used or the vput_final can wait for the lock to become free.
This is guarding against the race as described in the commit message and the commentary in the patch. vput itself is an immediate example and there are presumably vunref consumers which also suffer the problem.
Yes. What I'm after here is that the patch here does more than that, by also guarding against the race I've just described, which is non-existent.
Per my description above, while the problem is trivially visible with vput (and thus was used as an example), it does not start there. It starts with the layer allowing vhold to go up on doomed vnodes and doomed vnodes to get locked.
This adds a VOP_ISLOCKED() to every vdropl_final() whereas this is not necessary, that's my point.
What exactly are you planning to add to vput_final (which executes more often) which would be less expensive? Note vdropl_final only executes for *doomed* vnodes, which is again a minority of calls to vdrop.
Huh, so happens vtryrecycle unlocks and then vdrops, so it happens to not run into the problem after all. General remark stands though.
Certainly, but I don't see how that's relevant here. Once you have a hold count, nothing is going to free the vnode underneath you. And, again, there is nothing in the tree that unlocks a vnode after dropping a hold count, by contrast to what happens with vrele()/vput() due to the cascading 2-stage decrementation on the last v_usecount reference drop.
It's not too difficult to amend it to behave differently on VI_DEFDROP, is it?
It was my impression you were protesting the patch on this ground, assuming the current deferral can be used or the vput_final can wait for the lock to become free.
No, it's just that initially I hadn't done the mental exercise enough to see that vput_final() itself could no wait for the lock.
I do not insist too much on reusing the same deferral mechanism. It's just that, once vput_final() itself defers vnodes for dropping the last hold count, that's quite natural (and it's quite natural you didn't reuse it as you do so in vdropl_final()). My main point was doing the deferral in vput_final() instead of vdropl_final(), this was just a consecutive remark.
It starts with the layer allowing vhold to go up on doomed vnodes and doomed vnodes to get locked.
I don't see how a check in vput_final() wouldn't handle that (again, if you have an independent hold reference, nothing bad can happen; the problem is unlock after vput()/vrele(), inside the implementation or in consumers; unlock after vdrop() never happens).
This adds a VOP_ISLOCKED() to every vdropl_final() whereas this is not necessary, that's my point.
What exactly are you planning to add to vput_final (which executes more often) which would be less expensive? Note vdropl_final only executes for *doomed* vnodes, which is again a minority of calls to vdrop.
Nothing less expensive per se: The check would occur after the test about the doomed vnode in vput_final(). The only saving is that not all vdropl_final() would do it. If vput_final() is called much more, that's perhaps not a big saving.
Anyway, I think continuing this discussion for now is improductive:
- Only showing code would clear misunderstandings, and unfortunately I cannot dedicate time to this now, as I'm going to head to EuroBSDCon and presenting there.
- I agree that in theory, your approach works (even if I think it might not be optimal).
- Tests done my markj@ seems to indicate the problem is fixed in practice.
So don't want to stand in the way, and this can be revisited later with more time as needed.
sys/kern/vfs_subr.c | ||
---|---|---|
3661–3680 | That's great, but please commit this reformatting separately if possible. | |
4123–4124 | Confusing comment (wrong?). You don't want to bump freevnodes as the vnode at hand is still not effectively freed, this will be done in vdrop_doomed_defer() which will call again vdropl_final() and we will pass in the other branch above actually calling vfs_freevnodes_inc(). |
vput et al are the primary consumers of vdrop.
Anyway I just ran the following for few minutes: dtrace -n 'fbt::vdrop:entry,fbt::vput_final:entry { @[probefunc, ((struct vnode *)arg0)->v_irflag & 0x1] = count(); }'
while running poudriere
vdrop 1 53770 vdrop 0 7686767 vput_final 0 25784570
As in during tracing there were significantly more vput_final's than vdrops, and even then only few vdrops had encountered a doomed vnode. Or to put it differently, VOP_ISLOCKED is in the least frequently occuring spot.
sys/kern/vfs_subr.c | ||
---|---|---|
3661–3680 | Per the commentary in the opening description this is a separate commit. | |
4123–4124 | The intent is to spell out we transitioned v_holdcnt to 0 before coming here, meaning we "owe" to bump freevnodes. But we are undoing that operation in this atomic. I want to stress the former, I don't care about specific wording for it. |
sys/kern/vfs_subr.c | ||
---|---|---|
4123–4124 | Does something like this seem clearer? |
Looks mostly ok to me, thank you.
sys/kern/vfs_subr.c | ||
---|---|---|
4117 | Hypothetically, the thread holding the lock could make other modifications to the vnode after releasing the reference and before releasing the lock. In the !VOP_ISLOCKED(vp) case here, don't we still need an acquire barrier to synchronize with that unlock? The refcount itself (refcount_release() issues a release fence) won't provide enough synchronization. Or maybe we don't care, but that deserves some justification. | |
4119 | Just use vn_printf()? (I guess we should really have a vn_panic().) | |
4123–4124 | Both versions read fine to me. |
sys/kern/vfs_subr.c | ||
---|---|---|
4117 | I presume you meant VOP_UNLOCK, as we are already fully synchronized against mere refcount play. But yes, good spotting. I'm going to sleep on this change and update tomorrow. | |
4119 | I want this to panic on production kernels. On debug kernels extra info would be nice, so it goes the usual debug panic route. |
sys/kern/vfs_subr.c | ||
---|---|---|
4119 | Yeah but vn_printf() is available in production kernels, so I don't see a reason not to use it is all. That is, vn_printf(...); panic(...); |
sys/kern/vfs_subr.c | ||
---|---|---|
4116 | I am not sure I understand fully the comment. If you are saying the the CAS op on line 4119 must by sequentially consistent, then it is not. E.g. I am sure that plain atomic_cmpset_int() if fully relaxed on aarch64. |
sys/kern/vfs_subr.c | ||
---|---|---|
4116 | The equivalent on Linux is guaranteed to provide a full barrier, I think it's a useful property. Now that you mention it indeed the FreeBSD code does not guarantee this behavior and I don't see a seq_cst variant either (only rel or acq). So bare minimum this _seq_cst should be added, or even better, just guarantee the fence from atomic_cmpset. Chances are there is code in the tree which already depends on it. |
sys/kern/vfs_subr.c | ||
---|---|---|
4108 | ||
4116 | I believe atomic(9)s used to issue full barriers and a long time ago there was a tree-wide sweep to relax them since many were unneeded. I don't really understand why we need a fence in the ISLOCKED case. In that case we will defer the drop and lock and unlock the vnode. That provides an acq_rel fence, but why isn't that enough? |
sys/kern/vfs_subr.c | ||
---|---|---|
4116 |
It's probably true that many were unneeded, but 1. one should err on the safe side and include them in doubt 2. there should be a routine which *does* provide the fence if needed So I would argue the pragmatic approach is to implement _relaxed and _seq_cst, with a tree-wide sweep to default to _seq_cst.
The comment about cmpset vs full fence was about the cmpset which set the VHOLD_NO_SMR flag. |
sys/kern/vfs_subr.c | ||
---|---|---|
4116 |
Right, but why exactly does that cmpset need a fence? |
sys/kern/vfs_subr.c | ||
---|---|---|
4116 | Ok, I see what you are getting at. If we are merely racing against someone who has not unlocked yet, our refcount_release already posted an acquire fence so we can just VOP_ISLOCKED. If new consumers showed up in the meantime bumping the hold count, the last one trying to do the 1->0 transition will block on the interlock which we are holding. Once our cmpset fails due to v_holdcnt being 1 and we leave, *that* thread will handle the lock. In this case indeed the full fence is not necessary. Mentally I was in a space where I optimize this to instead transition 1->VHOLD_NO_SMR directly, saving an atomic for doomed vnodes. But in that case the fence *is* necessary so that we only observe VOP_ISLOCKED after everyone was told to bugger off. This was the next step but I did not want to optimize in a bugfix. So "funnily" enough vhold_smr already depends on at least acq when using atomic_fcmpset, which is currently absent. In the current patch the cmpset setting VHOLD_NO_SMR should have rel semantics. So... as I see it this may be easier to reason about if I *do* the optimization with the patch. |
So where are we WRT this change? If this not going to be finished soon, may be Mark' change should go in to be cherry-picked for 15.0.