Paths

Table of Contentst

ufs: handle LoR between snap lock and buffer lock
AbandonedPublic
Actions

Authored by kib on Jan 18 2022, 10:53 AM.

Details

Reviewers

chs
mckusick
markj
jah

Summary

Read or write to the snapshot inode first locks the snap lock, which is
equal to the vnode lock, then locks a buffer from the operational range.
On the other hand, copy on write path starts with the buffer locked, and
needs to lock the snap lock.  If the same buffer is simultaneously read
and written, it gets into LoR.

Break the reversal by locking buffers of the snapshot inode in ffs VOPs
with LK_NOWAIT, and temporary unlock the snap lock if possible deadlock
is detected.  This is similar to how ffs_update() handles the situation.

Also

ufs: be more persistent with finishing some operations

when the vnode is doomed after relock.  The mere fact that the vnode is
doomed does not prevent us from doing UFS operations on it while it is
still belongs to UFS, which is determined by non-NULL v_data.  Not
finishing some operations, e.g. not syncing the inode block only because
the vnode started reclamation, is not correct.

Add macro IS_UFS() which incapsulates the v_data != NULL, and use it
instead of VN_IS_DOOMED() for places where the operation completion is
important.

Also

buf_alloc(): lock the buffer with LK_NOWAIT

The buffer must not be accessed by any other thread, it is freshly
allocated.  As such, LK_NOWAIT should be nop but also it prevents
recording the order between the buffer lock and any other locks we might
own in the call to getnewbuf().  In particular, if we own FFS snap lock,
it should avoid triggering false positive warning.

Per-commit view is available at https://kib.kiev.ua/git/gitweb.cgi?p=deviant3.git;a=shortlog;h=refs/heads/ufs

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

kib created this revision.Jan 18 2022, 10:53 AM

Herald added a subscriber: imp. · View Herald TranscriptJan 18 2022, 10:53 AM

kib requested review of this revision.Jan 18 2022, 10:53 AM

Also handle buffer locking in ffs_balloc2().
Add ffs_snap_relock() helper.

This patch eliminates my bufwait <-> snaplk LOR warning when doing background fsck.

This revision is now accepted and ready to land.Jan 19 2022, 12:03 AM

The buffer must not be accessed by any other thread, it is freshly allocated.

I wouldn't be surprised if something relies on type-stability of bufs and indeed violates this assumption. I still agree with the change though.

sys/ufs/ffs/ffs_balloc.c
748	Still needs to be handled?
1194	Why did it become a printf()?
sys/ufs/ffs/ffs_snapshot.c
2778	I don't quite follow: the vnode lock is held when setting VI_DOOMED and is held across the VOP_RECLAIM call, so when is there a difference between checking vp->v_data and (vp->v_iflags & VI_DOOMED) != 0? Does the vnode lock get dropped somewhere?

kib marked an inline comment as done.Jan 19 2022, 8:54 PM

kib added inline comments.

sys/ufs/ffs/ffs_balloc.c
748	In principle yes, I am not still not sure what is the best action there. Since ffs_balloc() patch is relatively complicated, I postponed solving it. Will do it after Peter finishes testing.
1194	I believe it is right to abort (or ignore) the operation instead of panicing if external conditions prevent it completion. There bread could fail for many different reasons, e.g. device error, device removal, vnode reclamation. The possible effect is a leak of some blocks on the filesystem. Isn't it better than panicing?
sys/ufs/ffs/ffs_snapshot.c
2778	VIRF_DOOMED is set in the beginning of the vgonle() method. Then, it calls at least vinvalbuf()->bufobj_invalbuf()->BO_SYNC() == bufsync()->VOP_FSYNC(), which for SU might result in the deadlock avoidance triggered e.,g. in softdep_sync_buf() returning ERELOOKUP and indicating internal vnode relock. Attempt to do the last inactivation descent into VOP_INACTIVE() which might do ffs_truncate() which also might trigger deadlock avoidance, relock, and ERELOOKUP. So indeed there is a difference. Intent is, if we are called from the vgonel() context, we must be able to finish syncs. Checks for VN_IS_DOOMED() abort the operation, while checks for IS_UFS() allow them to run.

Generally looks good. Definitely want to see if Peter turns up anything. If Peter clears it, freeing newb will be necessary. Unfortunately freeing newb is going to require writing a bunch of code to unwind a lot of soft update stuff. It will be a lot easier if we reorder the code to get the buffer first and then only allocate the block once we know we will have a buffer for it. Since the buffer is hashed on lbn rather than bn we do not need to have a block allocated before we get its buffer.

sys/ufs/ffs/ffs_balloc.c
595	I suggest putting these two functions at the end of the file rather than between the two ffs_balloc routines.
990–996	Why do you not check for error after this call?

In D33921#767802, @mckusick wrote:

Generally looks good. Definitely want to see if Peter turns up anything. If Peter clears it, freeing newb will be necessary. Unfortunately freeing newb is going to require writing a bunch of code to unwind a lot of soft update stuff. It will be a lot easier if we reorder the code to get the buffer first and then only allocate the block once we know we will have a buffer for it. Since the buffer is hashed on lbn rather than bn we do not need to have a block allocated before we get its buffer.

I moved the order as you suggested. In principle, cg block buffer lock should be after any other buffer locks and after snap lock, but we will see.

Handle buffer locking in SU getdirtybuf().
Reorder buffer and block allocation in some places in ffs_balloc_ufs2() to avoid the need of freeing the block.

This revision now requires review to proceed.Jan 19 2022, 11:04 PM

kib edited the summary of this revision. (Show Details)Jan 19 2022, 11:12 PM

Changes to avoid needing to free newb look good. Really glad to not have to figure out how to do that. Other changes look right. Awaiting Peter's tests to see if you have found all the needed changes.

markj added inline comments.Jan 24 2022, 5:01 PM

sys/ufs/ffs/ffs_snapshot.c
2778	I think this should be explained in a comment.
sys/ufs/ffs/ffs_softdep.c
6765
13125	What is the significance of `vp` here? Isn't it some arbitrary vnode that was passed by process_truncates()? Is it actually possible for `vp` to refer to a snapshot here? If so, shouldn't getdirtybuf() take a vnode rather than an inode? The new parameter is only used to relock the vnode.
14770	It looks like it should be called `locked_ip` or something like that, since as far as I understand there is no particular relationship between `bp` and `ip`'s vnode.
sys/ufs/ufs/inode.h
250	BTW, there are many inline `vp->v_data != NULL` checks in UFS, e.g., in ffs_softdep.c.

kib marked 6 inline comments as done.Jan 24 2022, 9:53 PM

kib added inline comments.

sys/ufs/ffs/ffs_softdep.c
13125	I used vp because it is really a locked vnode that should be accounted for when sleeping for the buffer lock. On the other hand, for most callers of getdirtybuf(), ip is readily available after lookup of inode from inodedep, but vp needs to be followed by pointer.
sys/ufs/ufs/inode.h
250	I was able to find only one more unconverted case.

markj added inline comments.Jan 25 2022, 2:59 PM

sys/ufs/ffs/ffs_softdep.c
13125	Ok.
sys/ufs/ufs/inode.h
250	My mistake, I think I was looking at unpatched sources by accident.

So more I am trying to fix consequences of the snap relock following Peter reports, more I doubt the original issue.

Lets consider ffs_read() report: we got a locked UFS vnode, which reads some buffer, which is locked to be read. There are two cases: vnode is a snapshot, or it is not. In the second case, the order is right from the start.

In the first case, I claim that all snapshot vnode' buffers are after snaplk: in both ffs_copyonwrite() and ffs_snapblkfree(), we lock snaplk, and then iterate over snapshots taking corresponding buffer locks as needed, and this is IMO the natural order. So in the first case, read from the snapshot vnode, the order is right.

On the other hand, all data blocks are before snaplk. In other words, the order of the locks is (UFS vnode)->(data bufwait)->(snaplk)->(snapshot block bufwait). We could get LoR only in case when data buffer is becoming snapshot data during some of the snapshot low ops (copyonwrite or blkfree). But we never do that rename, we always allocate new block for snapshot and copy data as needed.

Let me summarize: I now think that the original report about LoR in ffs_read() is a false positive from witness, since it cannot distinguish between snap/normal data buffers.

P.S. Even if I am wrong above, I do not think that the approach with relocking of the snaplk lock is workable. For instance, in ffs_balloc_ufs2(), unlocking snaplk allows other thread to create softupdate tracker for allocation of the same logical block which is not yet allocated.

In D33921#770058, @kib wrote:

So more I am trying to fix consequences of the snap relock following Peter reports, more I doubt the original issue.

Lets consider ffs_read() report: we got a locked UFS vnode, which reads some buffer, which is locked to be read. There are two cases: vnode is a snapshot, or it is not. In the second case, the order is right from the start.

In the first case, I claim that all snapshot vnode' buffers are after snaplk: in both ffs_copyonwrite() and ffs_snapblkfree(), we lock snaplk, and then iterate over snapshots taking corresponding buffer locks as needed, and this is IMO the natural order. So in the first case, read from the snapshot vnode, the order is right.

On the other hand, all data blocks are before snaplk. In other words, the order of the locks is (UFS vnode)->(data bufwait)->(snaplk)->(snapshot block bufwait). We could get LoR only in case when data buffer is becoming snapshot data during some of the snapshot low ops (copyonwrite or blkfree). But we never do that rename, we always allocate new block for snapshot and copy data as needed.

Let me summarize: I now think that the original report about LoR in ffs_read() is a false positive from witness, since it cannot distinguish between snap/normal data buffers.

P.S. Even if I am wrong above, I do not think that the approach with relocking of the snaplk lock is workable. For instance, in ffs_balloc_ufs2(), unlocking snaplk allows other thread to create softupdate tracker for allocation of the same logical block which is not yet allocated.

I generally agree that it is a false positive. The specific back trace that started this discussion is this:

lock order bufwait -> snaplk attempted at:
#0 0xffffffff80c71a4d at witness_checkorder+0xbdd
#1 0xffffffff80bd0071 at lockmgr_xlock_hard+0x71
#2 0xffffffff80bd09aa at lockmgr_args+0x1fa
#3 0xffffffff80f12694 at ffs_copyonwrite+0x1a4
#4 0xffffffff80f3630f at ffs_geom_strategy+0xcf
#5 0xffffffff80cc20ec at bufwrite+0x24c
#6 0xffffffff80f0b0fd at ffs_update+0x2dd
#7 0xffffffff80f078b0 at sysctl_ffs_fsck+0x1a0
#8 0xffffffff80c11e6c at sysctl_root_handler_locked+0x9c
#9 0xffffffff80c11203 at sysctl_root+0x213
#10 0xffffffff80c118c7 at userland_sysctl+0x187
#11 0xffffffff80c116fc at sys_sysctl+0x5c
#12 0xffffffff810cb3be at amd64_syscall+0x12e
#13 0xffffffff8109c1db at fast_syscall_common+0xf8

Here we are trying to update the superblock, so we hold a lock on the devvp vnode, get the buffer for the block on devvp that holds the superblock then need to check if we need to make a copy of it on a snapshot so lock snaplk to check.

Two cases to consider. Can we get a vnode deadlock and can we get a buffer deadlock.

If we assume the the natural order is first to lock some vnode and second to lock snaplk, then to get a vnode deadlock we need a scenario in which a process with the snaplk vnode locked is going to lock some other vnode.

The read path will never need to call ffs_copyonwrite() so this problem cannot arise. The read path never locks another vnode than the one it is reading, so we cannot deadlock. User applications cannot open the snapshot for writing (even root processes). So there are no possible deadlocks with writing.

Turning to buffer deadlocks, the only way that the snapshot can be updated is when it internally decides to make a copy. Here we need another buffer in which to grab the old contents. As you noted, this will always be a new buffer so cannot conflict with any existing buffer, so that is not a problem.

Thus I do not see any possible deadlocks here. So, the question is how we avoid the false positive of this report? Once the witness code has reported an LOR, it remembers that it has made that report and will not do so again. Not sure how it marks that, but if there is a way to say `do not report bufwait -> snaplk' that would solve the problem.

I believe that the combination of D34072 and D34073 should be enough to properly fix the warnings. Note that D34073 is WIP by its nature, I will ask Peter to test after we agree on the approach and D34072 is landed.