Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
Did you observed this in real world?
If doing this, I believe it would be also useful to assert that the vnode is not locked on the freevnode() entry.
I think so. I saw this report, but there is no reproducer and I am not sure how the trylock can fail: https://syzkaller.appspot.com/bug?extid=38e26cf6f959e886f110
If doing this, I believe it would be also useful to assert that the vnode is not locked on the freevnode() entry.
That seems reasonable.
This assumption is wrong for the call to vdropl() under VI_DEFINACT in vgonel(). But it seems true for all other calls.
More generally, we should probably drop the LK_NOWAIT instead (see inline comment) and deal with the possible WITNESS annoyance.
sys/kern/vfs_subr.c | ||
---|---|---|
2234–2236 |
That path won't reach freevnode(). Note that in the VI_DEFINACT case, we have v_holdcnt > 1.
More generally, we should probably drop the LK_NOWAIT instead (see inline comment) and deal with the possible WITNESS annoyance.
But then who is holding the vnode lock? In the stack trace, inotify_reap() is releasing the final reference and preparing to free the vnode, so I would expect it to be unlocked. Maybe blocking for the vnode lock is the right thing to do, but why is it necessary?
If we are in freevnode(), there _must_ be no other users for its pointer. Hopefully, if trylock fails, we should see the lock owner in lk_lock, and then see what it does.
Sorry, I read the code too quickly, you're of course both right that nobody should be having a reference to the vnode inside vdropl_final(), and in particular it shouldn't be locked there. This hints to a vnode use without reference problem.
I finally managed to reproduce this locally. The panicking thread:
#15 kdb_enter (why=0xffffffff827dd8c0 <str> "panic", msg=0xffffffff827dd8c0 <str> "panic") at /home/markj/sb/main/src/sys/kern/subr_kdb.c:556 #16 0xffffffff81551a19 in vpanic (fmt=<optimized out>, fmt@entry=0xffffffff8283e320 <str> "aieee", ap=ap@entry=0xfffffe004be0f8e0) at /home/markj/sb/main/src/sys/kern/kern_shutdown.c:960 #17 0xffffffff81551735 in panic (fmt=0xffffffff8283e320 <str> "aieee") at /home/markj/sb/main/src/sys/kern/kern_shutdown.c:887 #18 0xffffffff817c0277 in freevnode (vp=0xfffffe005f619898) at /home/markj/sb/main/src/sys/kern/vfs_subr.c:2238 #19 0xffffffff817bf37c in vdropl_final (vp=0xffffffff830004e8 <panicstr>) at /home/markj/sb/main/src/sys/kern/vfs_subr.c:4006 #20 0xffffffff817ab43e in vdropl_impl (vp=0xffffffff830004e8 <panicstr>, enqueue=true) at /home/markj/sb/main/src/sys/kern/vfs_subr.c:4034 #21 0xffffffff817aa259 in vput_final (vp=0xfffffe005f619898, func=func@entry=VRELE) at /home/markj/sb/main/src/sys/kern/vfs_subr.c:3679 #22 0xffffffff817a9299 in vrele (vp=0xffffffff830004e8 <panicstr>) at /home/markj/sb/main/src/sys/kern/vfs_subr.c:3704 #23 0xffffffff817813e1 in inotify_free_watch (watch=0xfffffe0003a78140) at /home/markj/sb/main/src/sys/kern/vfs_inotify.c:383 #24 inotify_reap (arg=<optimized out>, pending=<optimized out>) at /home/markj/sb/main/src/sys/kern/vfs_inotify.c:416 #25 0xffffffff8165d80d in taskqueue_run_locked (queue=queue@entry=0xfffffe0003b6ad00) at /home/markj/sb/main/src/sys/kern/subr_taskqueue.c:517 #26 0xffffffff8165f668 in taskqueue_thread_loop (arg=<optimized out>) at /home/markj/sb/main/src/sys/kern/subr_taskqueue.c:829 #27 0xffffffff814a79fc in fork_exit (callout=0xffffffff8165f550 <taskqueue_thread_loop>, arg=<optimized out>, frame=0xfffffe004be0ff40) at /home/markj/sb/main/src/sys/kern/kern_fork.c:1153
the lock holder:
#12 0xffffffff822f7fff in lapic_handle_timer (frame=0xfffffe004bfdd660) at /home/markj/sb/main/src/sys/x86/x86/local_apic.c:1324 #13 <signal handler called> #14 __sanitizer_cov_trace_pc () at /home/markj/sb/main/src/sys/kern/subr_coverage.c:98 #15 0xffffffff817b223f in vop_unlock_debugpre (ap=ap@entry=0xfffffe004bfdd7a0) at /home/markj/sb/main/src/sys/kern/vfs_subr.c:6030 #16 0xffffffff82305607 in VOP_UNLOCK_APV (vop=0xffffffff833b7fc0 <dead_vnodeops>, a=a@entry=0xfffffe004bfdd7a0) at vnode_if.c:2157 #17 0xffffffff817aab57 in VOP_UNLOCK (vp=0xfffffe005f619898) at ./vnode_if.h:1146 #18 vput (vp=vp@entry=0xfffffe005f619898) at /home/markj/sb/main/src/sys/kern/vfs_subr.c:3718 #19 0xffffffff817d10cb in kern_funlinkat (td=td@entry=0xfffffe0003dc7780, dfd=dfd@entry=-100, path=0x200000000040 "/dev/input/eventN", fd=fd@entry=-200, pathseg=pathseg@entry=UIO_USERSPACE, flag=flag@entry=0, oldinum=0) at /home/markj/sb/main/src/sys/kern/vfs_syscalls.c:2103 #20 0xffffffff817d0631 in sys_unlink (td=0xfffffe0003dc7780, uap=0xfffffe0003dc7ba8) at /home/markj/sb/main/src/sys/kern/vfs_syscalls.c:1973 #21 0xffffffff82140562 in syscallenter (td=0xfffffe0003dc7780) at /home/markj/sb/main/src/sys/amd64/amd64/../../kern/subr_syscall.c:193 #22 amd64_syscall (td=0xfffffe0003dc7780, traced=0) at /home/markj/sb/main/src/sys/amd64/amd64/trap.c:1208
So the vnode is being removed, had vgone() called on it. Then kern_funlinkat() calls vput() to release the second-last reference and unlock the vnode. Before the unlock, inotify_reap() releases the last vnode reference. Because the vnode is already doomed, vput_final() will not lock the vnode before it calls freevnode(), so the trylock in freevnode() can legitimately fail.
It is not clear to me that it is safe to block on the vnode lock here, i.e., the witness warning mentioned in the comment is perhaps legitimate.
I could make inotify_reap() acquire the vnode lock first, but that is just a workaround IMO.
Maybe vinactive() should be responsible for cleaning v_pollinfo?
IMO the bug is there:
vput()
if (!refcount_release(&vp->v_usecount)) { VOP_UNLOCK(vp); return; }
vput() released the reference, but still operates on the vnode, unlocks it.
Probably yes. I wrote a patch to rework vput() to avoid this, but discovered commit 74c4b7cc607c4, so we cannot easily drop the vnode lock first.
I do not see any correct way to fix it other than to ref v_holdcount around vrele/vput/vunref and then do final vdrop().
We might nano-optimize it by using double-CAS to drop v_usecount and simultaneously take v_holdcount, but it is the best I see.
The code can go back to doing unlock followed by unref. The crux of the race was a half-constructed vnode hanging out in the hash and being unexpectedly used by someone else.
ufs got patched by me in 6c44a3e019affbf4f22e8b9abedc249c00b28afb and by kib in 2c7ada9917383b6135568e2c81c0116d124bd4a2 to always vgone bad vnodes before unlocking.
Since then I introduced vnode state tracking in 829f0bcb5fe24bb523c5a9e7bd3bb79412e06906 which asserts that the filesystem either marked the vnode as fully constructed or called vgone on it before the unlock, validating the race is no longer possible.
Iow the revert at hand can be reverted.
This boots: https://people.freebsd.org/~mjg/vnode_unlock_unref.diff
I presume there will be extra patches and pho is going to get asked to test, so I'm going to hold off from committing myself and instead ask someone folds it into their patchset if this is going to go in.
The issue of dropping the last ref and still holding the lock does not disappear though because of vunref.
Also note the vput caller might already hold the lock the right way while this is the last reference, so this can add single-threaded work.
It is not clear to me that it is safe to block on the vnode lock here, i.e., the witness warning mentioned in the comment is perhaps legitimate.
It probably is safe in the sense nothing would deadlock from it in the current kernel, but it also looks like a thing which would start getting in the way at some point.
Maybe vinactive() should be responsible for cleaning v_pollinfo?
I presume you are fishing for a spot which holds the vnode lock. vgonel should be able to do it. A good spot would be after advisory locks are sorted out. As is I can't tell if any extra handling might be necessary apart from mere destroy_vpollinfo (maybe it will need to become a dummy value indicating "don't look at it"?).
Note I proposed a fix to the panic which is orthogonal to the vput change.
So I slept on it. Per my previous remark about possible extra locking *and* not completely taking care of the issue anyway due to vunref, I don't think there is any benefit of this going in (and there is a visible loss) and consequently I'm not going to sign-off on it., but I'm not going to try to stand in the way either.
If someone was committed to eradicating holding the vnode lock while there are no references held that would be a different story. All that said, given the triviality in terms of how the diff looks like, I would have no comments if someone wants this in (or an equivalent) and just folds it into their own change without credit. Note vput_final would use some refactoring, could be done while messing with vput.
What about something like the following (ignoring vunref() for a moment)?
if (refcount_release_if_last(&vp->v_usecount)) vput_final(vp, VPUT); else { VOP_UNLOCK(vp); vrele(vp); }
and consequently I'm not going to sign-off on it., but I'm not going to try to stand in the way either.
If someone was committed to eradicating holding the vnode lock while there are no references held that would be a different story. All that said, given the triviality in terms of how the diff looks like, I would have no comments if someone wants this in (or an equivalent) and just folds it into their own change without credit. Note vput_final would use some refactoring, could be done while messing with vput.
Looking at vunref() for a bit and taking fdesc_pathconf() as an example:
vref(vp); VOP_UNLOCK(vp); error = kern_fpathconf(...); vn_lock(vp, LK_SHARED | LK_RETRY); vunref(vp); return (error);
The vnode can be vgone()d while the lock is dropped. Suppose usecount == 2 and the vunref() decrements it to 1. Then the VOP_UNLOCK in the VOP_PATHCONF caller can unlock a freed vnode. I don't see a way to fix this without either acquiring the vnode lock in freevnode() in order to provide a barrier (and this sounds deadlock-prone), or changing the VOP contract to permit returning with the vnode unlocked if it is doomed.
Whatever, this should be fixed. Since the original author attitude is that he does not care, please go ahead.
and consequently I'm not going to sign-off on it., but I'm not going to try to stand in the way either.
If someone was committed to eradicating holding the vnode lock while there are no references held that would be a different story. All that said, given the triviality in terms of how the diff looks like, I would have no comments if someone wants this in (or an equivalent) and just folds it into their own change without credit. Note vput_final would use some refactoring, could be done while messing with vput.
Looking at vunref() for a bit and taking fdesc_pathconf() as an example:
vref(vp); VOP_UNLOCK(vp); error = kern_fpathconf(...); vn_lock(vp, LK_SHARED | LK_RETRY); vunref(vp); return (error);The vnode can be vgone()d while the lock is dropped. Suppose usecount == 2 and the vunref() decrements it to 1. Then the VOP_UNLOCK in the VOP_PATHCONF caller can unlock a freed vnode. I don't see a way to fix this without either acquiring the vnode lock in freevnode() in order to provide a barrier (and this sounds deadlock-prone), or changing the VOP contract to permit returning with the vnode unlocked if it is doomed.
I do not understand what you are saying there. Which specific VOP_PATHCONF() would access the freed vnode?
Regarding vunref(), it was added to be used in vnode_pager_dealloc(). There the vnode is locked, and we need to drop the use reference held by the dying vm object. This is normally happens from the reclamation path, and desire was to do it without allowing other thread to see half-reclaimed vnode.
This would reduce scalability as it would replace fetchadd with a fcmpset loop.
vput_final(vp, VPUT);else {
VOP_UNLOCK(vp); vrele(vp);}
> and consequently I'm not going to sign-off on it., but I'm not going to try to stand in the way either. > > If someone was committed to eradicating holding the vnode lock while there are no references held that would be a different story. All that said, given the triviality in terms of how the diff looks like, I would have no comments if someone wants this in (or an equivalent) and just folds it into their own change without credit. Note vput_final would use some refactoring, could be done while messing with vput. Looking at vunref() for a bit and taking fdesc_pathconf() as an example:vref(vp);
VOP_UNLOCK(vp);
error = kern_fpathconf(...);
vn_lock(vp, LK_SHARED | LK_RETRY);
vunref(vp);
return (error);The vnode can be vgone()d while the lock is dropped. Suppose usecount == 2 and the vunref() decrements it to 1. Then the VOP_UNLOCK in the VOP_PATHCONF caller can unlock a freed vnode. I don't see a way to fix this without either acquiring the vnode lock in freevnode() in order to provide a barrier (and this sounds deadlock-prone), or changing the VOP contract to permit returning with the vnode unlocked if it is doomed.
Right, I don't know if this vunref business is tracktable. Holding the vnode lock is pretty entrenched.
At the same time I noted for the issue at hand you can probably just move the pollinfo destruction into vgone, which should avoid the entire problem.
This case is only the demonstration. The vnode cannot be operated upon after thread drops the reference its own.
I understand, but it fixes the race for vrele() consumers without requiring an extra vnode unlock+lock when transitioning 1->0 in the common case.
vput_final(vp, VPUT);else {
VOP_UNLOCK(vp); vrele(vp);}
> and consequently I'm not going to sign-off on it., but I'm not going to try to stand in the way either. > > If someone was committed to eradicating holding the vnode lock while there are no references held that would be a different story. All that said, given the triviality in terms of how the diff looks like, I would have no comments if someone wants this in (or an equivalent) and just folds it into their own change without credit. Note vput_final would use some refactoring, could be done while messing with vput. Looking at vunref() for a bit and taking fdesc_pathconf() as an example:vref(vp);
VOP_UNLOCK(vp);
error = kern_fpathconf(...);
vn_lock(vp, LK_SHARED | LK_RETRY);
vunref(vp);
return (error);The vnode can be vgone()d while the lock is dropped. Suppose usecount == 2 and the vunref() decrements it to 1. Then the VOP_UNLOCK in the VOP_PATHCONF caller can unlock a freed vnode. I don't see a way to fix this without either acquiring the vnode lock in freevnode() in order to provide a barrier (and this sounds deadlock-prone), or changing the VOP contract to permit returning with the vnode unlocked if it is doomed.Right, I don't know if this vunref business is tracktable. Holding the vnode lock is pretty entrenched.
At the same time I noted for the issue at hand you can probably just move the pollinfo destruction into vgone, which should avoid the entire problem.
It fixes the syzbot reports, but there is still a bug. A thread may unlock a vnode after it has been freed to the UMA zone. This can cause memory corruption if the slab is subsequently freed.
How?
Maybe I missed something in the previous comments.
The lock is supposed to be preventing the free to uma.
It used to be you had to have a ref of some sort to lock in the first place.
Whatever it is, things can be arranged so that a held vnode lock effectively gates the actual freeing of the object.
No, it is not that for long time. LK_DRAIN was removed and I doubt that it will ever return.
It used to be you had to have a ref of some sort to lock in the first place.
Whatever it is, things can be arranged so that a held vnode lock effectively gates the actual freeing of the object.
Only hold reference keep vnode from freeing.
Suppose a thread is executing the code fragment above, and the vnode is vgone()ed while the vnode lock is dropped. Then the thread calls vunref(), and the VOP_PATHCONF caller eventually unlocks the vnode.
It is possible that the vunref() call freed the vnode, no?
vunref() is paired with the vref() in the cited code fragment. But, when we enter the VOP, there must be already a reference on the vnode, so the vref()/vunref() pair seems to be redundant.