Page MenuHomeFreeBSD

vnode: Rework vput() to avoid holding the vnode lock after decrementing
AcceptedPublic

Authored by markj on Thu, Sep 18, 11:28 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Oct 9, 5:07 PM
Unknown Object (File)
Thu, Oct 9, 3:31 PM
Unknown Object (File)
Thu, Oct 9, 3:31 PM
Unknown Object (File)
Thu, Oct 9, 2:17 PM
Unknown Object (File)
Thu, Oct 2, 11:22 AM
Unknown Object (File)
Sat, Sep 27, 9:04 PM
Unknown Object (File)
Sat, Sep 27, 4:03 PM
Unknown Object (File)
Wed, Sep 24, 5:10 PM
Subscribers

Details

Reviewers
kib
mjg
olce
Summary

It is not safe to modify the vnode structure after releasing one's
reference. Modify vput() to avoid this. Use refcount_release_if_last()
to opportunistically call vput_final() with the vnode lock held since we
need the vnode lock in order to deactivate the vnode, and it's silly to
drop the vnode lock and immediately reacquire it in this common case.

Note that vunref() has a similar flaw.

Reported by: syzbot+6676b3ff282d590b0fb3@syzkaller.appspotmail.com
Reported by: syzbot+38e26cf6f959e886f110@syzkaller.appspotmail.com

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 67147
Build 64030: arc lint + arc unit

Event Timeline

This revision is now accepted and ready to land.Fri, Sep 19, 12:56 AM

note vnodes were not freeable at all and that holding the lock to maintain liveness is probably highly entrenched in the layer

i'm going to sleep on it, i suspect a guarantee can be introduced that the unlock is safe

vnode: Rework vput() to avoid holding the vnode lock after decrementing

I think this part is an overall improvement. The vrele() you added will decrement the reference without the vnode lock held (nor the interlock) and in the non-contended case will not execute vput_final(), which would need reacquiring the lock. Holding locks for a shorter amount of time of course is better in general, as it decreases the occurence of contention. The risk of this change is to pessimize the contended case where vrele() will want to reacquire the vnode lock if vnode inactivation is needed, but it seems reasonable to hope there will be no contention to acquire this lock as this is the last active reference.

It is not safe to modify the vnode structure after releasing one's
reference. Modify vput() to avoid this.

This part I don't understand, as I don't see how it matches the change. If you're the last one dropping a reference, it's safe to modify the vnode, as vput_final() will do, and both versions do not differ in this respect (the hold count is still non-zero, and is decremented at the end of vput_final()). In the original code, you even had the additional "guarantee" of holding the vnode lock. What do you mean exactly by "modifying the vnode structure"?

Use refcount_release_if_last() to opportunistically call vput_final() with the vnode lock held since we
need the vnode lock in order to deactivate the vnode, and it's silly to
drop the vnode lock and immediately reacquire it in this common case.

This description I think is wrong, because this change actually does the opposite. vput_final() in the original version is always called without prior dropping of the vnode, whereas here it can be (by the last added vrele()).

Which exact bug are you trying to fix with this? Are the syzkaller reports public?

Use refcount_release_if_last() to opportunistically call vput_final() with the vnode lock held since we
need the vnode lock in order to deactivate the vnode, and it's silly to
drop the vnode lock and immediately reacquire it in this common case.

This description I think is wrong, because this change actually does the opposite. vput_final() in the original version is always called without prior dropping of the vnode, whereas here it can be (by the last added vrele()).

Well, it is not wrong in describing the new state of affairs, but feels weird given what I've just written.

It is not safe to modify the vnode structure after releasing one's
reference. Modify vput() to avoid this.

This part I don't understand, as I don't see how it matches the change. If you're the last one dropping a reference, it's safe to modify the vnode, as vput_final() will do, and both versions do not differ in this respect (the hold count is still non-zero, and is decremented at the end of vput_final()). In the original code, you even had the additional "guarantee" of holding the vnode lock. What do you mean exactly by "modifying the vnode structure"?

The problem is when a vrele() call is _not_ the last one to drop a reference. Once the thread has invoked refcount_release() it's unsafe to modify the protected structure. VOP_UNLOCK does exactly that, by modifying the lock state word. So we must release the vnode lock first unless we're releasing the last reference.

Use refcount_release_if_last() to opportunistically call vput_final() with the vnode lock held since we
need the vnode lock in order to deactivate the vnode, and it's silly to
drop the vnode lock and immediately reacquire it in this common case.

This description I think is wrong, because this change actually does the opposite. vput_final() in the original version is always called without prior dropping of the vnode, whereas here it can be (by the last added vrele()).

Which exact bug are you trying to fix with this?

See https://reviews.freebsd.org/D52245#1201024

Are the syzkaller reports public?

https://syzkaller.appspot.com/bug?extid=6676b3ff282d590b0fb3
https://syzkaller.appspot.com/bug?extid=38e26cf6f959e886f110

The reproducers do not trigger the bug for me unless I add lots of load to the VM host.

So here is the situation as I understand it:

  1. vput
if (!refcount_release(&vp->v_usecount)) {
        VOP_UNLOCK(vp);
        return;
}

In principle, in the time window between dropping the reference and VOP_UNLOCK no longer touching the vnode any number of other routines could have came been called and ultimately freed said vnode. Note the VOP_UNLOCK thing was written in a way to not touch the vnode after the underlying locking primitive actually unlocks it. Similarly lockmgr does no touch it past the store to unlock either.

Or to put it differently, for this to be safe, freeing has to be gated on making sure nobody has the vnode locked.

  1. vunref

vunref consumers explicitly operate on a vnode which possibly no longer has any references and want it to remain usable until unlock.

The first piece of the solution to the puzzle is that in order to legally lock a vnode you need to have v_holdcnt on it. Then vdrop of the last ref would only need to wait for the vnode to become unlocked. But it may be it is called in a spot where this is not safe, so the operation should be deferred.

So that's it: deferred vdrop support and a lock trip from there if needed.

sketch:

diff --git a/sys/kern/vfs_subr.c b/sys/kern/vfs_subr.c
index f86bda2aa6f0..ca014fb4533e 100644
--- a/sys/kern/vfs_subr.c
+++ b/sys/kern/vfs_subr.c
@@ -4013,16 +4013,66 @@ vdrop(struct vnode *vp)
        vdropl(vp);
 }

+void
+vdropl_deferred(struct vnode *vp)
+{
+
+       ASSERT_VI_LOCKED(vp, __func__);
+       if ((vp->v_iflag & VI_DEFDROP) != 0) {
+               /*
+                * someone already deferred vdrop and then someone else
+                * vhold'ed in the meantime. we can just whack this.
+                */
+               refcount_release(vp->v_holdcnt);
+               VI_UNLOCK(vp);
+               return;
+       }
+       vlazy(vp);
+       vp->v_iflag |= VI_DEFDROP;
+       VI_UNLOCK(vp);
+       atomic_add_long(&deferred_drop, 1);
+}
+
+void
+vdrop_deferred(struct vnode *vp)
+{
+
+       ASSERT_VI_UNLOCKED(vp, __func__);
+       CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
+       if (refcount_release_if_not_last(&vp->v_holdcnt))
+               return;
+       VI_LOCK(vp);
+       vdropl_deferred(vp);
+}
+
 static __always_inline void
 vdropl_impl(struct vnode *vp, bool enqueue)
 {

        ASSERT_VI_LOCKED(vp, __func__);
        CTR2(KTR_VFS, "%s: vp %p", __func__, vp);
-       if (!refcount_release(&vp->v_holdcnt)) {
+       if (refcount_release_if_not_last(&vp->v_holdcnt)) {
                VI_UNLOCK(vp);
                return;
        }
+       /*
+        * Deal with the last reference
+        */
+       if (VOP_ISLOCKED(vp)) {
+               /*
+                * donated to the vlazy list traversal
+                */
+               vdropl_deferred(vp);
+               return;
+       }
+
+       /*
+        * whack the ref and continue
+        *
+        * refcount_release....
+        *
+        */
+
        VNPASS((vp->v_iflag & VI_OWEINACT) == 0, vp);
        VNPASS((vp->v_iflag & VI_DEFINACT) == 0, vp);
        if (VN_IS_DOOMED(vp)) {

Not included in the sketch is adjustment of the vlazy traversal code to also check for DEFDROP. It would cycle through the lock if needed and be done with it.

Note how with something like this in place a vhold coming in from vunref will *never* be the last one.

I don't have time right now to dig this in depth, but will later today hopefully. I have some immediate remarks that could help, though.

The problem is when a vrele() call is _not_ the last one to drop a reference. Once the thread has invoked refcount_release() it's unsafe to modify the protected structure. VOP_UNLOCK does exactly that, by modifying the lock state word. So we must release the vnode lock first unless we're releasing the last reference.

In D52608#1201919, @mjg wrote:

So here is the situation as I understand it:

  1. vput
if (!refcount_release(&vp->v_usecount)) {
        VOP_UNLOCK(vp);
        return;
}

In principle, in the time window between dropping the reference and VOP_UNLOCK no longer touching the vnode any number of other routines could have came been called and ultimately freed said vnode. Note the VOP_UNLOCK thing was written in a way to not touch the vnode after the underlying locking primitive actually unlocks it. Similarly lockmgr does no touch it past the store to unlock either.

Or to put it differently, for this to be safe, freeing has to be gated on making sure nobody has the vnode locked.

Unless I'm mistaken, no and no.

VOP_UNLOCK() modifies the lock state yes, and I don't see how this can be a problem (even for unionfs, which does specific trickery with locks). We have a non-zero hold count *even* after decrementing the use count to 0, and that hold count by construction cannot drop to 0 before our own vput_final()/vrele() since we were the holder of the last use count and are thus now the holder of the corresponding one unit of one hold count. Why VOP_UNLOCK() would be unsafe in spite of that?

Suppose the vnode is doomed, v_usecount is 2, v_holdcnt is 1 and both usecount holders are issuing vput.
Suppose both are read-locking it.

thread1                                 thread2
refcount_release usecount
// v_usecount is now 1
// proceeding to unlock
// .. getting preempted or
// stalled for whatever other reason
                                        refcount_release usecount
                                        // v_usecount is now 0, calling vput_final
                                        // vnode is doomed, so just unlock and vdrop
                                        VOP_UNLOCK
                                        // v_holdcnt is now 1
                                        vdrop
                                        // dropping v_holdcnt to 0 and seeing a doomed
                                        // vnode proceeds to free it
// vnode is now freed
VOP_UNLOCK
// uaf above

I don't have time right now to dig this in depth, but will later today hopefully. I have some immediate remarks that could help, though.

The problem is when a vrele() call is _not_ the last one to drop a reference. Once the thread has invoked refcount_release() it's unsafe to modify the protected structure. VOP_UNLOCK does exactly that, by modifying the lock state word. So we must release the vnode lock first unless we're releasing the last reference.

In D52608#1201919, @mjg wrote:

So here is the situation as I understand it:

  1. vput
if (!refcount_release(&vp->v_usecount)) {
        VOP_UNLOCK(vp);
        return;
}

In principle, in the time window between dropping the reference and VOP_UNLOCK no longer touching the vnode any number of other routines could have came been called and ultimately freed said vnode. Note the VOP_UNLOCK thing was written in a way to not touch the vnode after the underlying locking primitive actually unlocks it. Similarly lockmgr does no touch it past the store to unlock either.

Or to put it differently, for this to be safe, freeing has to be gated on making sure nobody has the vnode locked.

Unless I'm mistaken, no and no.

VOP_UNLOCK() modifies the lock state yes, and I don't see how this can be a problem (even for unionfs, which does specific trickery with locks). We have a non-zero hold count *even* after decrementing the use count to 0, and that hold count by construction cannot drop to 0 before our own vput_final()/vrele() since we were the holder of the last use count and are thus now the holder of the corresponding one unit of one hold count. Why VOP_UNLOCK() would be unsafe in spite of that?

We (the thread that did successful refcount_release()) does not own holdcount. The holdcount is handled by the last drop of usecount. In other words, if other thread drops the last usecount 'between' our release and our VOP_UNLOCK(), then vp might be already freed.

So I think the patch under review is just a minor loss in perf and only fixes some of the problem.

I'll try to hack up my proposal over the weekend. Best case scenario it can use the current traversal over the lazy list, worst another one will have to be added.

In D52608#1202070, @kib wrote:

We (the thread that did successful refcount_release()) does not own holdcount. The holdcount is handled by the last drop of usecount.

Yes, I was mistaken on this, as I forgot to consider the point of view of the thread not doing the last decrement of v_usecount.

In other words, if other thread drops the last usecount 'between' our release and our VOP_UNLOCK(), then vp might be already freed.

The vnode cannot be freed before being vgone(), which needs to acquire the vnode lock exclusively, and this is impossible before the thread which didn't do the last decrement actually calls VOP_UNLOCK(). But, the vnode could have been vgone() prior to all that. I think this is exactly what happens in the stack Mark reported here:

(thanks Mark for the references)

The change proposed here however does not solve the problem. The exact same race happens in vrele(), except that there is no VOP_UNLOCK() in vrele(). However, *callers* of vrele() themselves can call VOP_UNLOCK() afterwards, as vrele() doesn't impose a lock state on entry and just leaves back the vnode in the same state (modulo a LK_SHARED => LK_EXCLUSIVE upgrade).

So I think that, at least, vput_final() should not simply vdropl() when the vnode is doomed: If the vnode is locked by someone else, it should wait for it to be unlocked (e.g., using LK_SLEEPFAIL). Same problem in vdefer_inactive(). More review is needed, but I'd try something along these lines. Basically, the idea is to re-establish the invariant that a vnode cannot be freed when the vnode lock is held.

See the patch I posted in D52628 .

If the vnode is in any danger of being freed (as in, it is doomed), there is a VOP_ISLOCKED check performed and dropping the ref is deferred.

This also means a consumer which vunrefs the last hold count from under itself *remains* safe because it will defer based on the above. If the vnode is not doomed, anyone who tries to doom it has to take the lock first, again waiting out the current consumer.

Once vnlru gets to that vnode it will take the lock to wait.

More review is needed, but I'd try something along these lines. Basically, the idea is to re-establish the invariant that a vnode cannot be freed when the vnode lock is held.

that's what my patch is doing