Page MenuHomeFreeBSD

vfs: fix races between vdrop and VOP_UNLOCK
Changes PlannedPublic

Authored by mjg on Fri, Sep 19, 7:58 PM.
Tags
None
Referenced Files
F132416837: D52628.id162436.diff
Thu, Oct 16, 6:20 PM
F132407630: D52628.id162565.diff
Thu, Oct 16, 4:33 PM
F132402416: D52628.id.diff
Thu, Oct 16, 3:18 PM
Unknown Object (File)
Tue, Oct 14, 2:09 PM
Unknown Object (File)
Tue, Oct 14, 5:05 AM
Unknown Object (File)
Thu, Oct 9, 5:33 PM
Unknown Object (File)
Thu, Oct 9, 4:52 PM
Unknown Object (File)
Thu, Oct 9, 3:31 PM
Subscribers

Details

Reviewers
kib
markj
olce
Summary
commit 0978cc1c2b53ba1cce573e62b6807ca6a5efb9c3 (HEAD -> vdrop_def3)
Author: Mateusz Guzik <mjg@FreeBSD.org>
Date:   Fri Sep 19 18:22:12 2025 +0000

    vfs: make sure vnodes are not getting freed when they are locked

    The layer has an entrenched expectation that it is fine to unlock a
    vnode while no longer holding a reference on it. While this happens to
    work of the time, this can race against other threads releasing
    references.

    For example in vput:
            if (!refcount_release(&vp->v_usecount)) {
                    VOP_UNLOCK(vp);
                    return;
            }

    Another thread can race the unlock to call vput_final and eventually
    reture the vnode, before VOP_UNLOCK manages to finish. See the commit
    for a more detailed explanation.

    Fix the problem by waiting for locked operation to drain.

commit 416a19154df649ababf8bae04ed741e165df137d
Author: Mateusz Guzik <mjg@FreeBSD.org>
Date:   Mon Sep 22 14:09:51 2025 +0000

    vfs: reduce indentation in vput_final

    nfc

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

mjg requested review of this revision.Fri, Sep 19, 7:58 PM

Too late here to review this properly now. But see my last comment in D52608, which I posted not knowing about this new revision. I don't think we actually need to guarantee that vdropl() cannot free a vnode that is locked, as this case already should not happen in our tree. It seems that guaranteeing that while executing vput_final() is enough, which a priori doesn't require this more involved implementation (but I didn't do the exercise).

If your idea boils down to adding some form of lock drain to iput_final, it's not going to work.

  1. there are legal ways for new users to show up
  2. not everyone calling into vput_final has the lock held. adding a wait on such a lock can result in deadlocks

My patch only checks for the lock state when it is no longer legal for anyone to get a *new* lock. At worst, there may be pre-existing holders to wait out.

Perhaps one could check if this is the last v_holdcnt and use the current deferred inactive machinery to cover it, but it sounds too brittle to me.

In D52628#1202190, @mjg wrote:
  1. not everyone calling into vput_final has the lock held. adding a wait on such a lock can result in deadlocks

True. But we could use the already existing deferred inactive mechanism to avoid that problem.

  1. there are legal ways for new users to show up

Which ones?

This patch seems to fix the panics I referred to in my revision. Will you finish this patch?

This is finished, it just has extra parts to trigger defered drop + a printf that it happened. I'll upload a cleaned up variant shortly.

Did you verify you get the printfs from vdrop_doomed_process in your testcase?

mjg edited the summary of this revision. (Show Details)
In D52628#1203111, @mjg wrote:

This is finished, it just has extra parts to trigger defered drop + a printf that it happened. I'll upload a cleaned up variant shortly.

Did you verify you get the printfs from vdrop_doomed_process in your testcase?

Yeah, while running the test, every so often I see a burst of

vdrop_doomed_process: processing 0xfffffe005f834528                            
vdrop_doomed_process: processing 0xfffffe005fbc9898                            
vdrop_doomed_process: processing 0xfffffe005f8e8000                            
vdrop_doomed_process: processing 0xfffffe005f81ac08                            
vdrop_doomed_process: processing 0xfffffe005f7736e0                            
vdrop_doomed_process: processing 0xfffffe005f818370                            
vdrop_doomed_process: processing 0xfffffe005f81aa50                            
vdrop_doomed_process: processing 0xfffffe005f7741b8                            
vdrop_doomed_process: processing 0xfffffe005f807000                            
vdrop_doomed_process: processing 0xfffffe005f837dc0                            
vdrop_doomed_process: processing 0xfffffe005f7731b8
sys/kern/vfs_subr.c
327

These can be static.

  • make new fields static
  • expand commentary
mjg marked an inline comment as done.Mon, Sep 22, 2:37 PM

Why not put the defer action into vput_final() (and possibly use the already existing deferred inactive mechanism)?

By doing the defer in vdrop(), the extra thing you're doing is guarding against dropping a "preservation" reference (v_holdcnt) while a vnode is still locked and needs to be unlocked. Are we really having such cases in the tree? I've just checked vdrop() and vdropl() calls in tree and I see none.

Why not put the defer action into vput_final() (and possibly use the already existing deferred inactive mechanism)?

For a non-doomed vnode new users can pop up and disappear. If the vnode is doomed, the mechanism is not operational as ->v_mount is cleared.

By doing the defer in vdrop(), the extra thing you're doing is guarding against dropping a "preservation" reference (v_holdcnt) while a vnode is still locked and needs to be unlocked. Are we really having such cases in the tree? I've just checked vdrop() and vdropl() calls in tree and I see none.

This is guarding against the race as described in the commit message and the commentary in the patch. vput itself is an immediate example and there are presumably vunref consumers which also suffer the problem.

In D52628#1203128, @mjg wrote:

For a non-doomed vnode new users can pop up and disappear.

You're repeating yourself, and I repeat: Not in a way that is relevant. If I'm wrong, fine, please show me a case that you think is relevant and is not covered by handling this in vput_final(). And, anyway, deferral does not need to occur at all for non-doomed vnode.

If the vnode is doomed, the mechanism is not operational as ->v_mount is cleared.

It's not too difficult to amend it to behave differently on VI_DEFDROP, is it?

This is guarding against the race as described in the commit message and the commentary in the patch. vput itself is an immediate example and there are presumably vunref consumers which also suffer the problem.

Yes. What I'm after here is that the patch here does more than that, by also guarding against the race I've just described, which is non-existent. This adds a VOP_ISLOCKED() to every vdropl_final() whereas this is not necessary, that's my point.

In D52628#1203128, @mjg wrote:

For a non-doomed vnode new users can pop up and disappear.

You're repeating yourself, and I repeat: Not in a way that is relevant. If I'm wrong, fine, please show me a case that you think is relevant and is not covered by handling this in vput_final(). And, anyway, deferral does not need to occur at all for non-doomed vnode.

It is legal for a vnode to gain hold count while doomed and it is legal to lock afterwards.

For a specific case see vnlru_free_impl which ultimately calls vhold_recycle_free. Note patching up this specific consumer does not automatically resolve the problem, the entire kernel would have to be audited.

If the vnode is doomed, the mechanism is not operational as ->v_mount is cleared.

It's not too difficult to amend it to behave differently on VI_DEFDROP, is it?

Doing so would require implementing a custom deferral processing routine, which constitutes vast majority of this patch. It was my impression you were protesting the patch on this ground, assuming the current deferral can be used or the vput_final can wait for the lock to become free.

This is guarding against the race as described in the commit message and the commentary in the patch. vput itself is an immediate example and there are presumably vunref consumers which also suffer the problem.

Yes. What I'm after here is that the patch here does more than that, by also guarding against the race I've just described, which is non-existent.

Per my description above, while the problem is trivially visible with vput (and thus was used as an example), it does not start there. It starts with the layer allowing vhold to go up on doomed vnodes and doomed vnodes to get locked.

This adds a VOP_ISLOCKED() to every vdropl_final() whereas this is not necessary, that's my point.

What exactly are you planning to add to vput_final (which executes more often) which would be less expensive? Note vdropl_final only executes for *doomed* vnodes, which is again a minority of calls to vdrop.

Huh, so happens vtryrecycle unlocks and then vdrops, so it happens to not run into the problem after all. General remark stands though.

In D52628#1203136, @mjg wrote:

It is legal for a vnode to gain hold count while doomed and it is legal to lock afterwards.

Certainly, but I don't see how that's relevant here. Once you have a hold count, nothing is going to free the vnode underneath you. And, again, there is nothing in the tree that unlocks a vnode after dropping a hold count, by contrast to what happens with vrele()/vput() due to the cascading 2-stage decrementation on the last v_usecount reference drop.

It's not too difficult to amend it to behave differently on VI_DEFDROP, is it?

It was my impression you were protesting the patch on this ground, assuming the current deferral can be used or the vput_final can wait for the lock to become free.

No, it's just that initially I hadn't done the mental exercise enough to see that vput_final() itself could no wait for the lock.

I do not insist too much on reusing the same deferral mechanism. It's just that, once vput_final() itself defers vnodes for dropping the last hold count, that's quite natural (and it's quite natural you didn't reuse it as you do so in vdropl_final()). My main point was doing the deferral in vput_final() instead of vdropl_final(), this was just a consecutive remark.

It starts with the layer allowing vhold to go up on doomed vnodes and doomed vnodes to get locked.

I don't see how a check in vput_final() wouldn't handle that (again, if you have an independent hold reference, nothing bad can happen; the problem is unlock after vput()/vrele(), inside the implementation or in consumers; unlock after vdrop() never happens).

This adds a VOP_ISLOCKED() to every vdropl_final() whereas this is not necessary, that's my point.

What exactly are you planning to add to vput_final (which executes more often) which would be less expensive? Note vdropl_final only executes for *doomed* vnodes, which is again a minority of calls to vdrop.

Nothing less expensive per se: The check would occur after the test about the doomed vnode in vput_final(). The only saving is that not all vdropl_final() would do it. If vput_final() is called much more, that's perhaps not a big saving.

Anyway, I think continuing this discussion for now is improductive:

  1. Only showing code would clear misunderstandings, and unfortunately I cannot dedicate time to this now, as I'm going to head to EuroBSDCon and presenting there.
  2. I agree that in theory, your approach works (even if I think it might not be optimal).
  3. Tests done my markj@ seems to indicate the problem is fixed in practice.

So don't want to stand in the way, and this can be revisited later with more time as needed.

sys/kern/vfs_subr.c
3661–3680

That's great, but please commit this reformatting separately if possible.

4123–4124

Confusing comment (wrong?). You don't want to bump freevnodes as the vnode at hand is still not effectively freed, this will be done in vdrop_doomed_defer() which will call again vdropl_final() and we will pass in the other branch above actually calling vfs_freevnodes_inc().

This revision is now accepted and ready to land.Mon, Sep 22, 4:57 PM

vput et al are the primary consumers of vdrop.

Anyway I just ran the following for few minutes: dtrace -n 'fbt::vdrop:entry,fbt::vput_final:entry { @[probefunc, ((struct vnode *)arg0)->v_irflag & 0x1] = count(); }'

while running poudriere

vdrop                                                     1            53770
vdrop                                                     0          7686767
vput_final                                                0         25784570

As in during tracing there were significantly more vput_final's than vdrops, and even then only few vdrops had encountered a doomed vnode. Or to put it differently, VOP_ISLOCKED is in the least frequently occuring spot.

mjg marked an inline comment as done.Mon, Sep 22, 5:08 PM
mjg added inline comments.
sys/kern/vfs_subr.c
3661–3680

Per the commentary in the opening description this is a separate commit.

4123–4124

The intent is to spell out we transitioned v_holdcnt to 0 before coming here, meaning we "owe" to bump freevnodes. But we are undoing that operation in this atomic.

I want to stress the former, I don't care about specific wording for it.

sys/kern/vfs_subr.c
4123–4124

Does something like this seem clearer?

mjg marked an inline comment as done.Mon, Sep 22, 5:23 PM
mjg added inline comments.
sys/kern/vfs_subr.c
4123–4124

I'm going to add for Mark to wordsmith as he is a native speaker. Modulo possible grammar fixes this reads fine to me.

4123–4124

I meant *wait* for Mark.

Looks mostly ok to me, thank you.

sys/kern/vfs_subr.c
4117

Hypothetically, the thread holding the lock could make other modifications to the vnode after releasing the reference and before releasing the lock. In the !VOP_ISLOCKED(vp) case here, don't we still need an acquire barrier to synchronize with that unlock? The refcount itself (refcount_release() issues a release fence) won't provide enough synchronization. Or maybe we don't care, but that deserves some justification.

4119

Just use vn_printf()? (I guess we should really have a vn_panic().)

4123–4124

Both versions read fine to me.

sys/kern/vfs_subr.c
4117

I presume you meant VOP_UNLOCK, as we are already fully synchronized against mere refcount play.

But yes, good spotting.

I'm going to sleep on this change and update tomorrow.

4119

I want this to panic on production kernels.

On debug kernels extra info would be nice, so it goes the usual debug panic route.

sys/kern/vfs_subr.c
4119

Yeah but vn_printf() is available in production kernels, so I don't see a reason not to use it is all. That is,

vn_printf(...);
panic(...);
  • patch up some comments
  • add the missing acq fence
This revision now requires review to proceed.Tue, Sep 23, 8:41 PM
sys/kern/vfs_subr.c
4116

I am not sure I understand fully the comment. If you are saying the the CAS op on line 4119 must by sequentially consistent, then it is not. E.g. I am sure that plain atomic_cmpset_int() if fully relaxed on aarch64.

sys/kern/vfs_subr.c
4116

The equivalent on Linux is guaranteed to provide a full barrier, I think it's a useful property.

Now that you mention it indeed the FreeBSD code does not guarantee this behavior and I don't see a seq_cst variant either (only rel or acq). So bare minimum this _seq_cst should be added, or even better, just guarantee the fence from atomic_cmpset. Chances are there is code in the tree which already depends on it.

sys/kern/vfs_subr.c
4108
4116

I believe atomic(9)s used to issue full barriers and a long time ago there was a tree-wide sweep to relax them since many were unneeded.

I don't really understand why we need a fence in the ISLOCKED case. In that case we will defer the drop and lock and unlock the vnode. That provides an acq_rel fence, but why isn't that enough?

sys/kern/vfs_subr.c
4116

I believe atomic(9)s used to issue full barriers and a long time ago there was a tree-wide sweep to relax them since many were unneeded.

It's probably true that many were unneeded, but 1. one should err on the safe side and include them in doubt 2. there should be a routine which *does* provide the fence if needed

So I would argue the pragmatic approach is to implement _relaxed and _seq_cst, with a tree-wide sweep to default to _seq_cst.

I don't really understand why we need a fence in the ISLOCKED case. In that case we will defer the drop and lock and unlock the vnode. That provides an acq_rel fence, but why isn't that enough?

The comment about cmpset vs full fence was about the cmpset which set the VHOLD_NO_SMR flag.

sys/kern/vfs_subr.c
4116

The comment about cmpset vs full fence was about the cmpset which set the VHOLD_NO_SMR flag.

Right, but why exactly does that cmpset need a fence?

mjg planned changes to this revision.Wed, Sep 24, 9:49 AM
mjg added inline comments.
sys/kern/vfs_subr.c
4116

Ok, I see what you are getting at.

If we are merely racing against someone who has not unlocked yet, our refcount_release already posted an acquire fence so we can just VOP_ISLOCKED. If new consumers showed up in the meantime bumping the hold count, the last one trying to do the 1->0 transition will block on the interlock which we are holding. Once our cmpset fails due to v_holdcnt being 1 and we leave, *that* thread will handle the lock. In this case indeed the full fence is not necessary.

Mentally I was in a space where I optimize this to instead transition 1->VHOLD_NO_SMR directly, saving an atomic for doomed vnodes. But in that case the fence *is* necessary so that we only observe VOP_ISLOCKED after everyone was told to bugger off. This was the next step but I did not want to optimize in a bugfix.

So "funnily" enough vhold_smr already depends on at least acq when using atomic_fcmpset, which is currently absent.

In the current patch the cmpset setting VHOLD_NO_SMR should have rel semantics.

So... as I see it this may be easier to reason about if I *do* the optimization with the patch.

So where are we WRT this change? If this not going to be finished soon, may be Mark' change should go in to be cherry-picked for 15.0.

I'll be posting a different variant later today.