Page MenuHomeFreeBSD

Handle LoR in flush_pagedep_deps().
ClosedPublic

Authored by kib on Aug 20 2020, 5:35 PM.

Details

Summary

When operating in SU or SU+J mode, ffs_syncvnode() might need to instantiate other vnode by inode number while owning syncing vnode lock. Typically this other vnode is the parent of our vnode, but due to renames occuring right before fsync (or during fsync when we drop the syncing vnode lock, see below) it might be no longer parent.

More, the called function flush_pagedep_deps() needs to lock other vnode while owning the lock for vnode which owns the buffer, for which the dependencies are flushed. This creates another instance of the same LoR as was fixed in softdep_sync().

Put the generic code for safe relocking into new SU helper get_parent_vp() and use it in flush_pagedep_deps(). The case for safe relocking of two vnodes with undefined lock order was extracted into vn helper vn_lock_pair().

Due to call sequence ffs_syncvnode()->softdep_sync_buf()->flush_pagedep_deps(), ffs_syncvnode() indicates with ERELOOKUP that passed vnode was unlocked in process, and can return ENOENT if the passed vnode reclaimed. All callers of the function were inspected.

Because UFS namei lookups store auxiliary information about directory entry in in-memory directory inode, and this information is then used by UFS code that creates/removed directory entry in the actual mutating VOPs, it is critical that directory vnode lock is not dropped between lookup and VOP.
For softdep_prelink(), which ensures that later link/unlink operation can proceed without overflowing the journal, calls were moved to the place where it is safe to drop processing VOP because mutations are not yet applied. Then, ERELOOKUP causes restart of the whole VFS operation (typically VFS syscall) at top level, including the re-lookup of the involved pathes.
[Note that we already do the same restart for failing calls to vn_start_write(), so formally this patch does not introduce new behavior]

Similarly, unsafe calls to fsync in snapshot creation code were plugged. A possible view on these failures is that it does not make sense to continue creating snapshot if the snapshot vnode was reclaimed due to forced unmount.

Patch adds a framework that for DIAGNOSTICS builds tracks exclusive vnode lock generation count. This count is memoized together with the lookup metadata in directory inode, and we assert that accesses to lookup metadata are done under the same lock generation as they were stored.

In collaboration with: pho
Reported by: syzkaller (through markj)

This diff contains the following technically independent parts:

  1. vn_lock_pair()
  2. ERELOOKUP handling at top level of VFS.
  3. Move of softdep_prelink() to places in the UFS VOPs flow where it is safe to abort VOP execution still.
  4. Code for safe instantiation of vnodes by inode number, while owning other vnodes and buffers locks, mostly introduction and use of get_parent_vp().
  5. DIAGNOSTICS framework to track and check UFS vnode exclusive lock generation and corresponding lookup auxiliary data.
  6. Some local fixes for VOP_FSYNC()/ffs_syncvnode calls where they cannot be safely done, mostly because we own more locks than ffs_syncvnode() knows about.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/ufs/ffs/ffs_softdep.c
3178 ↗(On Diff #76043)

These three lines are the reason why the current patch does not work in +J case. Really, we cannot call ffs_syncvnode() this way.

There are more issues in -J case as well, but I did not analyzed it completely.

sys/ufs/ffs/ffs_snapshot.c
398 ↗(On Diff #76043)

I mean, before we'd call ffs_syncvnode() even if cgaccount() returned an error. I can't tell if it is intentional or not.

sys/ufs/ffs/ffs_snapshot.c
398 ↗(On Diff #76043)

It is safe to call ffs_syncvnode() in the sense that it would not destroy any state. What I wanted to avoid is to obliterate error from cgaccount(). If we get an error for any reason, the snapshot creation is aborted and the vnode is vput(). Syncing it before the abort is mostly a nop from the functional PoV.

The get_parent_vp() function seems to be handling two different problems.
In its first instance it is dealing with locking its parent. Here we already have the function vn_vget_ino() which presumably could be used to handle this situation.
In its second and third instances it is avoiding a LOR of acquiring an inode lock while holding a buffer lock. Here we need a function like the proposed get_parent_vp(). However, we will no longer need the first (vp) argument as we are acquiring a child inode so do not have to release the already held parent inode before acquiring the child inode.

sys/ufs/ffs/ffs_snapshot.c
372 ↗(On Diff #76043)

Unrelated to this patch, but the third argument should be DATA_ONLY.
The point here is to flush dirty data, not to get it coherent on disk (that happens below).
Indeed we are doing unnecessary extra writes because we are flushing an indirect block that we are still filling with block pointers.

399 ↗(On Diff #76043)

Unrelated to this patch, but the third argument should be DATA_ONLY.
The point here is to flush dirty data, not to get it coherent on disk (that happens below).
Indeed we are doing unnecessary extra writes because we are flushing an indirect block that we are still filling with block pointers.

sys/ufs/ffs/ffs_softdep.c
1391 ↗(On Diff #76043)

This function needs a comment. Here is a first cut assuming I understand its purpose:

/*
 * This function fetches inode inum on mount point mp. As we
 * already hold a locked buffer, we must not block on acquiring
 * the new inode lock as we will get into a lock-order reversal
 * with the buffer lock and possibly get a deadlock. Thus we
 * must unlock the buffer before doing a blocking lock for the
 * inode. We return ERESTART if we have had to unlock the buffer
 * so that the function can reassess its state.
 */
1406 ↗(On Diff #76043)

Shouldn't this parameter be pvp (pvp is the parameter whose mode we are checking)?

12686 ↗(On Diff #76043)

Why can't we just use vn_vget_ino() here?

The get_parent_vp() function seems to be handling two different problems.
In its first instance it is dealing with locking its parent. Here we already have the function vn_vget_ino() which presumably could be used to handle this situation.
In its second and third instances it is avoiding a LOR of acquiring an inode lock while holding a buffer lock. Here we need a function like the proposed get_parent_vp(). However, we will no longer need the first (vp) argument as we are acquiring a child inode so do not have to release the already held parent inode before acquiring the child inode.

It is basically two vnodes + one buffer, and all of them needs to be locked. Since we cannot safely relock the buffer, the current loop which unlocks buffer after first non-blocking attempt failed, avoids almost identical copy of the function that would deal just with two unrelated vnodes.

There are at least two very serious problems with the patch as is, I probably should have removed reviewers from it for now. But since people read the review, let me explain.

First issue is that softdep_prelink(), in +J case, calls ffs_syncvnode(dvp) and ffs_syncvnode(vp), while both dvp and vp are locked. This is a guaranteed LoR, which cannot be solved inside ffs_syncvnode() because it does not know that some other vnode is locked.

Second issue is that get_parent_vp() relocks dvp, which allows other threads to invalidate dp->i_offset, causing havoc for ufs directory entries manipulation, and to ufs dirhash.

I do not have a good plan how to handle either of the issues.

sys/ufs/ffs/ffs_softdep.c
12686 ↗(On Diff #76043)

No, unfortunately not. Issue is that the recorded ino might be no longer the parent of the vp. The example which started this patch actually flips parent and child around. We have to assume that two nodes are unrelated and engage into the algorithm which never sleeps with either of vnode locked.

Current snap of the patch.

Added a couple of inline comments.

sys/ufs/ffs/ffs_softdep.c
3222 ↗(On Diff #78375)

It would be useful to have a comment describing what this function does.

3242 ↗(On Diff #78375)

It would be useful to have a comment describing what this function does. Presumably similar to the one for the next function below (softdep_prelink()).

kib marked 2 inline comments as done.Oct 19 2020, 11:30 AM
kib added inline comments.
sys/ufs/ffs/ffs_softdep.c
3222 ↗(On Diff #78375)

This is just a helper for softdep_prerename().

3242 ↗(On Diff #78375)

I did it, and also updated comment for softdep_prelink() which now follows different protocol and must be called earlier in VOPs, before any mutation is done.

But it is premature to document this work IMO, it is still not finished from the code PoV. Peter reported at least one more unfixed bug.

kib marked 2 inline comments as done.

Document softdep_pre*().

Rebase.
Fix build for !DIAGNOSTICS.
Latest batch of the fixes for bugs reported by Peter.

Last batch of fixes.
Passed full stress2 run.

kib marked an inline comment as done.

Provide extensive comment about get_parent_vp(), which in essence overviews the whole patch with its ERELOOKUP mechanism.

This is an impressive piece of work, what a lot of effort to fix this LOR. Took a couple of hours to review, but overall looks good. A couple of minor inline comments.

sys/ufs/ffs/ffs_softdep.c
624 ↗(On Diff #78863)

This is missing the third (int) parameter.

sys/ufs/ufs/ufs_vnops.c
1147 ↗(On Diff #78863)

What is this?

kib marked 2 inline comments as done.Sat, Oct 31, 11:14 PM

This is an impressive piece of work, what a lot of effort to fix this LOR. Took a couple of hours to review, but overall looks good. A couple of minor inline comments.

It is not just a fix for this specific LoR. I would consider this patch as a pass over all uses of ffs_syncvnode(), which ensures that we only use the function in safe context, where other vnodes and buffers locks are not taken.

sys/ufs/ufs/ufs_vnops.c
1147 ↗(On Diff #78863)

Right, this should call softdep_prelink() as well. This leaked from earlier prototype but probably was not tested enough because whiteouts are too obscure.

kib marked an inline comment as done.

Do softdep_prelink() from ufs_whiteout.
Make !SU stab for softdep_prelink() compilable.

This revision is now accepted and ready to land.Sun, Nov 1, 4:59 PM
sys/kern/vfs_vnops.c
279 ↗(On Diff #79022)

After r367130 I believe you need a NDREINIT here.

kib marked an inline comment as done.

Add missed RDREINIT in vn_open_cred().

This revision now requires review to proceed.Sun, Nov 1, 8:40 PM
sys/kern/vfs_syscalls.c
3727 ↗(On Diff #79045)

While you're here, could this use of -1 be converted to use a proper errno value?

sys/kern/vfs_vnops.c
3331 ↗(On Diff #79045)

This sentence is a bit confusing since "vp1" and "vnode" refer to the same thing. Maybe, "vp1_locked indicates whether vp1 is exclusively locked; if not, vp1 must be unlocked."

3334 ↗(On Diff #79045)

The function

3379 ↗(On Diff #79045)

So we may sleep up to 100ms, assuming hz=1000? That seems like a long time. I thought typical vnode lock hold times would be much smaller than that, unlike buf locks.

It might be nicer to write 100 as a function of hz. hz=100 is a common setting in guest VMs.

sys/sys/vnode.h
773 ↗(On Diff #79045)

Group it with vn_lock() above?

sys/ufs/ffs/ffs_softdep.c
1419 ↗(On Diff #79045)

I think it's supposed to be "Thus if"

1421 ↗(On Diff #79045)

"before blocking on the vnode lock"

1423 ↗(On Diff #79045)

s/function/caller/ perhaps

3254 ↗(On Diff #79045)

"prehandle" is a pretty generic name. If it's supposed to be a softdep_prerename() helper, maybe softdep_prerename_vnode()?

3354 ↗(On Diff #79045)

Is it int on purpose? Some callers are using bool literals.

kib marked 17 inline comments as done.Tue, Nov 10, 9:04 PM
kib added inline comments.
sys/kern/vfs_vnops.c
3379 ↗(On Diff #79045)

We should own the vnode lock while we own buffer lock. There are exceptions, like we do not lock devvp for io, and async io 'gets out of vnode lock' when buffer lock owner is reassigned LK_KERNPROC. But otherwise, we keep vnode lock for the duration of io.

As consequence, vnode lock could be held for quite long time. For instance, on busy HDD 100 ms is completely normal, and I can regularly see peaks up to several seconds.

On the other hand, devices with deep queues and low latency like good nvme provide very different vnode lock hold times, sure.

sys/ufs/ffs/ffs_softdep.c
3254 ↗(On Diff #79045)

I considered to use it in softdep_prelink() as well, but not now.

3354 ↗(On Diff #79045)

K&R definition + bool look strange to me.

kib marked 3 inline comments as done.

Handled Mark' notes.

markj added inline comments.
sys/kern/vfs_vnops.c
3379 ↗(On Diff #79406)

Missing ws around /

3392 ↗(On Diff #79406)

Same here.

3379 ↗(On Diff #79045)

Should we perhaps add a debug counter for pause() calls?

This revision is now accepted and ready to land.Thu, Nov 12, 2:13 PM
kib marked 3 inline comments as done.

Unify code to sleep random interval, add counter.

This revision now requires review to proceed.Thu, Nov 12, 6:25 PM
This revision is now accepted and ready to land.Thu, Nov 12, 6:55 PM
This revision was automatically updated to reflect the committed changes.