Page MenuHomeFreeBSD

unionfs: rework locking scheme to only lock a single vnode
Needs ReviewPublic

Authored by jah on Wed, May 29, 4:51 AM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jun 13, 8:35 AM
Unknown Object (File)
Fri, May 31, 4:38 PM
Unknown Object (File)
Wed, May 29, 5:55 AM
Unknown Object (File)
Wed, May 29, 5:54 AM

Details

Reviewers
kib
markj
olce
pho
Summary

Instead of locking both the lower and upper vnodes, which is both
complex and deadlock-prone, only lock the upper vnode, or the lower
vnode if no upper vnode is present.

In most cases this is all that is needed; for the cases in which
both vnodes do need to be locked, this change also employs deadlock-
avoiding techniques such as LK_NOWAIT and vn_lock_pair().

There are still some corner cases in which the current implementation
ends up taking multiple vnode locks across different filesystems
without taking special steps to avoid deadlock; those cases have
been noted in the comments.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 58215
Build 55103: arc lint + arc unit

Event Timeline

jah requested review of this revision.Wed, May 29, 4:51 AM

These changes have proven stable in the testing I've done so far, but I still see this as a somewhat rough patch.
I expect there will be plenty of review feedback, but my hope is that what I have here can at least be used as a starting point to fix unionfs locking.

Some things to note:
--There are quite a few places in which various ideas put forth in Olivier's design doc (per-mount caching, use of vn_generic_copy_file_range(), just to name two) would clearly improve things.
--Where feasible, I've taken extra steps to use deadlock-avoiding tactics such as vn_lock_pair(), LK_NOWAIT, and exchanging locks whenever the need arises to hold multiple vnode locks in different filesystems. I've also added comments in places where this isn't so feasible to do. However, it's also quite possible that some of these extra steps are unnecessarily paranoid; with things such as the udvp->lvp transition for example, it's unclear to me under what circumstances an LOR deadlock would become a practical concern. Therefore I may end up being able to re-simplify some of this code.

Sorry, I've lagged on this one. I'll review it on Monday.

Thanks for this important cleanup on the road to the new unionfs!

The changes are quite extensive, I couldn't review all of them, going to continue tomorrow. In the meantime, posting my current notes.

Comments are mostly inline.

Here's a more general comment prompted by the code comment in lines 130-138. That unionfs can generally rely on something similar as a cache of IDs currently doesn't seem realistic to me. That would imply that all filesystem types used as lower layers can actually provide unique IDs in all circumstances, which I don't think is or will be true, especially considering unionfs itself as a lower filesystem. Moreover, as we grow support for concurrent lower layer modifications, caching IDs from other filesystems will lead to very hard consistency/lifecycle problems. The best approach, and perhaps the only sensible one, I have seen so far is that unionfs has to completely rely on lower layer filesystems for all operations, and for this must get rid of any cache it maintains from data on them. Roughly, for lookups, we have to always rely on names and their resolution by underneath filesystems, and when we need persistant handles on underneath files, we have to keep references on the corresponding vnodes. The copy-up operation indeed is serializing in that it cannot be started multiple times without the risk of corrupting the result, and for this they must be some flag (which can be seen as a transient cache of operations initiated by unionfs, by contrast with the other caches mentioned above; it's not enough for full concurrent operations, but that's a start) indicating the operation is in progress. For shadow directories, I don't think we really need that as we can rely instead on the underlying filesystems performing the directory creation and reporting if some directory already exists atomically (or fix them if they don't), but it doesn't hurt.

sys/fs/unionfs/union_subr.c
279–284

A minor point, but I'd rather only have two separate cases depending on whether un_uppervp is NULL, to avoid unlocking un_lowervp and then calling vrele() on it (which may relock it), by using vput() instead in this case.

392–414

I agree with most of your reasoning above.

As currently unionfs always creates shadow directories, we know that dvp locked means the corresponding upper layer directory's vnode is locked also, so locking uppervp always respects the order "parent, then child".

And it is true that cross-FS locking can happen in other circumstances. The only thing is, I'm not sure why you think this case is particularly difficult to handle here. Given the previous paragraph, only the lowervp case (when uppervp is NULL) is of concern, and I think the simple suggested change attached to this comment is enough. I'm just asking in case there is something I missed, but not requesting any actual change.

418–432

You're actually changing local behavior here. If lowervp is doomed, and uppervp is not NULL, lowervp is kept in the new unionfs node, whereas it was forcibly set to NULL before. Also, if uppervp is doomed, we now bail out, whereas before we would continue with lowervp if not NULL.

Both changes make sense and were in fact in my plans. The first one leads me to wondering if keeping the doomed lowervp instead of setting it to NULL can have any unwanted effect given the original code and your changes here. The answer may come as I progress browsing into this review.

sys/fs/unionfs/union_vnops.c
130–138

I'd suggest suppressing the end of the comment, as I don't think this is where we are going to head to. Please see my general comment as for why.

I ran a longer test with this patch and found this:

20240612 02:19:29 all (18/18): unionfs9.sh
ps -lp55088
UID   PID  PPID C PRI NI   VSZ  RSS MWCHAN  STAT TT     TIME COMMAND
  0 55088 37241 6  68  0 13088 2324 unioncp I+    0  0:00.01 find /mnt12 -type f -maxdepth 2 -ls
You have new mail.
root@mercat1:~ # procstat -k 55088
  PID    TID COMM                TDNAME              KSTACK                       
55088 100476 find                -                   mi_switch sleepq_switch sleepq_catch_signals sleepq_wait_sig _sleep unionfs_set_in_progress_flag unionfs_lookup VOP_CACHEDLOOKUP_APV vfs_cache_lookup VOP_LOOKUP_APV vfs_lookup namei kern_statat sys_fstatat amd64_syscall fast_syscall_common 
root@mercat1:~ #

https://people.freebsd.org/~pho/stress/log/log0526.txt

In D45398#1039564, @pho wrote:

I ran a longer test with this patch and found this:

20240612 02:19:29 all (18/18): unionfs9.sh
ps -lp55088
UID   PID  PPID C PRI NI   VSZ  RSS MWCHAN  STAT TT     TIME COMMAND
  0 55088 37241 6  68  0 13088 2324 unioncp I+    0  0:00.01 find /mnt12 -type f -maxdepth 2 -ls
You have new mail.
root@mercat1:~ # procstat -k 55088
  PID    TID COMM                TDNAME              KSTACK                       
55088 100476 find                -                   mi_switch sleepq_switch sleepq_catch_signals sleepq_wait_sig _sleep unionfs_set_in_progress_flag unionfs_lookup VOP_CACHEDLOOKUP_APV vfs_cache_lookup VOP_LOOKUP_APV vfs_lookup namei kern_statat sys_fstatat amd64_syscall fast_syscall_common 
root@mercat1:~ #

https://people.freebsd.org/~pho/stress/log/log0526.txt

Thanks for testing; this looks like a dumb mistake in unionfs_clear_in_progress_flag(). I'll fix that along with the next round of changes after I gather more review feedback.

sys/fs/unionfs/union_subr.c
279–284

Good catch; I've already made the same change elsewhere in this PR specifically to avoid relocking in vrele(), but it looks like I missed this spot.

392–414

I think the comment makes the fix sound worse than it would be. Probably the biggest issue is that, as the code is currently structured, the same fix would also need to be repeated below for the case in which unionfs_ins_cached_vnode() returns an existing vnode and that vnode has to be locked.

Of course that's also yet another argument for just getting rid of that cache, I just would prefer not to do that here since this change is already more than big enough. I guess I could just make the simple change above, and then note the remaining locking issue below along with something like "TODO: get rid of this stupid cache".

418–432

I would expect that to be OK. A doomed lowervp would imply forced unmount of the lower FS, but that can only happen if the unionfs is first recursively unmounted. Since we effectively hold the lock on the new unionfs vnode at this point, I'd therefore expect either the insmntque1() call above to fail due to the in-progress unmount, or (if the recursive unmount fully executes after insmntque1()) the new vnode to be doomed by the vflush() shortly after we unlock it.

sys/fs/unionfs/union_vnops.c
130–138

Will change. I have no idea what the final unionfs will look like, so this comment will almost certainly be proven wrong no matter what.

sys/fs/unionfs/union_subr.c
870

wakeup() should not be conditional on unp != NULL

Well, I'm almost done, but not completely (I need to finish reviewing unionfs_lock() and the lines below it). Here's a new round of comments/notes/requests.

Concerning unionfs_mkshadowdir(), I've added a series of suggested minor changes. These are mostly to clarify and simplify things (and, I suspect, marginally improve concurrency). You may as well forget about them if you think this is too much work for now (because, e.g., they would cause a new long round of testing).

The main problems I've spotted so far are the lack of some unlocks in unionfs_lookup() and the VV_ROOT case in unionfs_noderem() probably not handled correctly.

And, to correct something I've said:

The best approach, and perhaps the only sensible one, I have seen so far is that unionfs has to completely rely on lower layer filesystems for all operations, and for this must get rid of any cache it maintains from data on them.

Well, that's true, but only for information that can be modified by the underlying filesystems, which I had in mind when writing that. However, identity of the vnodes themselves is not included in these, so in fact a cache based on vnodes identities (and v_hash) is perfectly feasible, and will indeed allow to avoid the duplicate vnode problem in the lookup, and elsewhere. I'll also have to check for other foreseen uses developed in the full proposal.

sys/fs/unionfs/union_subr.c
392–414

I'm not sure there is any need for a fix in unionfs_ins_cached_vnode(). The current thread cannot have the containing directory's lower vnode locked (else, there is some violation of the cross-filesystem lock rule, basically not to establish any order, higher up in the call stack) because of systematic creation of shadow directories at lookup.

(I also have to go back to my project to globally simplify locking of multiple vnodes in the VFS and double-check whether it can really work.)

Removing the cache is primarily for semantics and lifecycle reasons, but indeed will have a huge simplification effect on several parts of unionfs. However, I don't think it makes much sense to do so without at the same time revising the semantics. I'll do both in the first phase of the plan.

418–432

Ah yes, dooming of the lower vnode would happen only on recursive unmount. This is also a paradigm that I will change. But that's true for the time being.

I'd like to check what could happen nonetheless after the lower vnode is doomed, because as you say, even if vflush() is eventually called, we don't really know what "shortly" means, and I'd rather not rely on that.

488–491

A small comment explaining why lvp needs to be locked would be great. What about the VV_ROOT case?

827–829

Minor simplification.

870

Yes, that's a problem, and likely the cause of @pho 's deadlock.

938

(Minor.)

Label rename.

944–946

(Minor.)

udvp (with alias dvp) is locked before vput(), unlocked by it and then relocked in unionfs_mkshadowdir_exit2. I'd suggest replacing all this dance by the minimal set of locking operations.

Label rename.

966–968

(Minor.)

Same as above.

971–972

(Minor.)

Same as above.

993–999

(Minor.)

Add call to vn_finished_write() here, as I've suppressed it from exit labels below. Label rename.

1013–1014

(Minor.)

Earlier call to vn_finished_write(). Also to compensate for its removal under the labels at the end, consequence of avoiding superfluous unlock/lock cycles.

1014–1033

(Minor.)

Avoid some superfluous lock dance (see also inline comments above). Label renames.

1031

(Minor.)

Label rename.

1582–1584

I wouldn't say that VOP_OPEN() is not truly needed. The current state in the VFS is sadly completely ad-hoc, some filesystem expecting it and some others not, with various degrees of consequences. I'm not sure at this point if we will require VOP_OPEN()/VOP_CLOSE() around VOP_READDIR() or not at all, only that we'll have to make a consistent choice.

1658

unionfs_check_rmdir() should return 0 if the directory is (determined) empty or ENOTEMPTY if it is not, but only if the determination didn't get any error. Any error in each of these calls should lead to bailing out immediately with that error, and callers adapted to cope.

This is not a new bug. However, the old code was quite close, but would in the end convert any error to ENOTEMPTY, while the new code steps farther away from that.

sys/fs/unionfs/union_vnops.c
223–225

lvp must be unlocked here.

263–265

Same as above for uvp.

627–629

Please keep this warning, as I don't think this case can occur (and should be fixed if it does).

837–838

(Too bad C is not Lispy enough...)

1729–1734

VOP_RENAME() seems to be a typo here. Did you mean VOP_READDIR()?

2243–2248

VV_ROOT case? (see unionfs_noderem())

Well, I'm almost done, but not completely (I need to finish reviewing unionfs_lock() and the lines below it). Here's a new round of comments/notes/requests.

Concerning unionfs_mkshadowdir(), I've added a series of suggested minor changes. These are mostly to clarify and simplify things (and, I suspect, marginally improve concurrency). You may as well forget about them if you think this is too much work for now (because, e.g., they would cause a new long round of testing).

The main problems I've spotted so far are the lack of some unlocks in unionfs_lookup() and the VV_ROOT case in unionfs_noderem() probably not handled correctly.

And, to correct something I've said:

The best approach, and perhaps the only sensible one, I have seen so far is that unionfs has to completely rely on lower layer filesystems for all operations, and for this must get rid of any cache it maintains from data on them.

Well, that's true, but only for information that can be modified by the underlying filesystems, which I had in mind when writing that. However, identity of the vnodes themselves is not included in these, so in fact a cache based on vnodes identities (and v_hash) is perfectly feasible, and will indeed allow to avoid the duplicate vnode problem in the lookup, and elsewhere. I'll also have to check for other foreseen uses developed in the full proposal.

Use of the vnode pointer as the identifier was indeed what I had in mind when I originally wrote the comment.

sys/fs/unionfs/union_subr.c
392–414

Technically, we don't proactively create shadow directories during lookup if the unionfs instance is mounted read-only (however unlikely it may be for a user to do that). But I'd also like to go back to the question I asked at the top of this review: how likely is it for this specific udvp->lvp locking sequence to present a practical risk of deadlock? In other words, how likely is it that some other agent will want to lock those two specific vnodes on different filesystems in the opposite order?

418–432

Actually I don't think this is a case we should encounter today. At this point, if the insmntque1() call hasn't already failed then the unionfs unmount operation's vflush() call should block on the lock we currently hold for the newly-allocated vnode. This should then prevent dooming of the underlying vnodes.

Perhaps these VN_IS_DOOMED() checks should instead be converted to VNASSERTs?

488–491

This is really to simplify unionfs_lock() (per the comment there) by synchronizing against any pending lock operation that may still be using the lower vnode lock. That should never be the case for the root vnode, as the root vnode should always have both an upper and lower vnode for its entire lifecycle, so unionfs_lock() should never attempt to use the lower root vnode. Moreover, if we are unmounting a non-"below" unionfs mount, this vn_lock_pair() call will assert because lvp will already be locked (as it is the covered vnode).

I'll add a comment to this effect.

944–946

These are error handling cases, so I wouldn't be very concerned about the extra locking cycles. But the other issue is that dvp may be reclaimed due to relocking of udvp in various places (unionfs_relookup, VOP_MKDIR, VOP_SETATTR). At that point, in locking terms udvp will no longer be an alias for dvp, so dvp must be explicitly relocked anyway.

I do prefer your label naming and placement of vn_finished_write() though, so I'll change those.

sys/fs/unionfs/union_vnops.c
223–225

oops, good catch

263–265

good catch, should also branch to unionfs_lookup_cleanup.

627–629

This warning actually does occur, and in fact it's extremely spammy during unionfs stress2 testing. IIRC this is because vgonel() calls VOP_CLOSE() on any active vnode, regardless of whether anything had it open.

1729–1734

Indeed, that's what I meant.

sys/fs/unionfs/union_vnops.c
627–629

Looking back over unionfs_get_node_status(), perhaps a better change would be to keep the warning but allow the close() path to use a NULL status object if one doesn't already exist, rather than allocating a new one through unionfs_get_node_status(). This would certainly be more efficient.

New round of review. I think I have looked at all the changes now, but may do an additional quick check pass tomorrow.

I'm proposing an alternate way to deal with cross-FS locking problems (in an inline response to your question; I'll elaborate later on the details, but they are pretty straightforward).

There is a remaining bug in unionfs_getextattr. The rest is answers to your latest responses, clarifications and requests for more comments explaining some code.

sys/fs/unionfs/union_subr.c
392–414

That's right, I forgot the read-only case (which I don't think is that unlikely), mostly because assuming that the containing directory has an upper vnode is so ingrained in the current code (and OK as an assumption as long as it is made as part of a write operation). This also is something that I'll likely change (time of creation for shadow directories should become configurable).

Concerning the different FS deadlock risk, I have a hard time trying to find any useful scenario where two FSes would be combined in opposite ways. That said, foot shooting is easy, and an administrator may try to mount unionfs with layers swapped (I'm pretty sure there is even a PR about such a scenario), which could lead to deadlocks. And that is an unacceptable system behavior, however infrequent it is. However, this doesn't mean this has to be prevented at the level of vnodes. A deadlock avoidance scheme that would directly prevent such erroneous mounts seems a much better solution to me, and will allow to simplify a lot of code.

418–432

Since dooming of the underlying vnodes cannot happen before unmount of the union above, if insmntque1() doesn't return failure after having locked the target underlying vnode, the latter simply can't be doomed before it is unlocked. That means that the two VN_IS_DOOMED() calls can't return true, and the blocks below the ifs are effectively dead code.

Given the unionfs vnode is itself locked (through the appropriate underlying vnode), I agree that lowervp can't be doomed regardless it was locked or not because the unionfs vnode can't be doomed yet, given that inmsntque1() checks for MNTK_UNMOUNT, and dounmount() sets this flag first thing, so before dooming the unionfs vnode, which requires its lock. I think we should add at least an assertion about lowervp.

To conclude, I'd just remove the ifs, whose blocks are dead code, and would add assertions that both underlying vnodes are not doomed. And your changes here indeed do not disturb what callers see.

488–491

And, for a below mount, calling vn_lock_pair() on the root vnode would assert as well since the upper vnode's lock is locked recursively.

Yes, a comment is warranted. Thanks.

1014

Trailing whitespace.

1015

Trailing whitespace.

1663

Superfluous whitespace.

sys/fs/unionfs/union_vnops.c
627–629

Looking back over unionfs_get_node_status(), perhaps a better change would be to keep the warning but allow the close() path to use a NULL status object if one doesn't already exist, rather than allocating a new one through unionfs_get_node_status(). This would certainly be more efficient.

Yes, I spotted that more than a year ago and still remember it. Would be fine.

There is indeed the VOP_CLOSE() pairing problem in the VFS. I had some plan at some point, but mostly forgot it and will have to think about it again.

681–684

You have to reload unp here for the ENOENT case, else this may cause a crash or an access to the wrong structure from the code at unionfs_open_abort.

713

Trailing whitespace.

715–716

Trailing whitespace.

832–837

Same trailing whitespaces as in unionfs_open().

1958–1960

Trailing whitespace.

2165

Trailing whitespace.

2239–2246

All this block seems superfluous. There's no need to tweak flags on LK_UPGRADE or LK_DOWNGRADE. These don't make sense if the calling thread hasn't already locked the vnode, and in this case the underlying lock object can only be changed by this very thread (by an operation that requires holding the lock exclusively and gives back the lock as it is), which means the current thread can always know the state of the lock correctly. Therefore, we should just obey it.

2241

Please add a comment explaining why this is needed, along the lines of: "In-between unlock of the interlock and lock of the lower vnode, another process has a window where it could also acquire the lower vnode lock and then doom the unionfs vnode (changing its lock to v_lock) or perform an operation leading to the creation of an upper vnode, which then would be the one to lock. So we have to check for this once the lower vnode is locked, and if so restart the locking operation."

2243–2248

VV_ROOT case? (see unionfs_noderem())

Answered in unionfs_noderem(), thanks.

Please add a small comment saying that 'lvp_locked' is necessary false on VV_ROOT.

2243–2248

Mentioning unionfs_node_update() as well here and the additional guarantee we need below.

(Proposed text has been justified by hand.)

2675–2678

unp must be reloaded and could be NULL here (as you handled it in unionfs_setextattr). Alternatively, test for error being ENOENT.

sys/fs/unionfs/union_subr.c
392–414

You might be thinking of bug 172334, which landed on me fairly recently. If there is a general way to prevent such conflicting mounts at the time the mount is requested (and if that is in fact that only case that could lead to LOR in places like this), then I agree that would definitely be preferable to a vnode-centric solution.

sys/fs/unionfs/union_vnops.c
2239

'flags' should be passed here instead of ap->a_flags, otherwise we may miss LK_CANRECURSE set above. More generally, I'll clean up the flags handling in this function so that all updates are made against the 'flags' local, with ap->a_flags only modified in the one case in which we need to pass 'ap' along.

2239–2246

I think you're right that the LK_DOWNGRADE case is superfluous: that should instead be an assert, as already holding lvp's lock exclusive should guarantee that we don't first issue the downgrade request to lvp and then find that the unionfs vnode has been doomed or copied up.

The LK_UPGRADE case is needed though: LK_UPGRADE may cause the lock to be temporarily dropped (e.g. consider the case in which the lock has multiple shared holders), creating a window in which dooming or copy-up is possible.

sys/fs/unionfs/union_subr.c
392–414

Yes, exactly. I had listed that PR in the proposal document. I don't think we need to design/implement the mount dependencies deadlock avoidance scheme right now, as it isn't a prerequisite for the changes here. I'll address that point in the first phase as well.

944–946

But the other issue is that dvp may be reclaimed due to relocking of udvp in various places (unionfs_relookup, VOP_MKDIR, VOP_SETATTR).

Right, I have been too optimistic with this. Given that, I agree it's not worth the complexity of trying to distinguish the case where dvp and udvp locks are still aliased to try to optimize locking. Optimizing locking is not just for performance, it also gives hints of locking expectations at some points, so can help understanding (or, at least, help ask the right questions).

1013–1014

Avoid an unlock/relock for uvp in the common case.

sys/fs/unionfs/union_vnops.c
2239–2246

Yes, the LK_DOWNGRADE case is superfluous for the reason you stated, so an assert would be better.

I forgot about a possible lock drop in LK_UPGRADE. You're right then, the LK_UPGRADE case where LK_EXCLUSIVE is explicitly set is absolutely necessary (and the flag has to be explicit because we are changing the lock on which to operate, and we don't have this one locked).

sys/fs/unionfs/union_subr.c
1013–1014

I think for this we'll want VOP_VPUT_PAIR(udvp, &uvp, false), as we'll be unlocking udvp while a lock is still held on its child.

sys/fs/unionfs/union_subr.c
1013–1014

You're right, this is what, e.g., kern_mkdirat() does.

The need to call VOP_VPUT_PAIR() (a misnomer) complicates the VFS interface in non-obvious ways (and is not documented at all). For a while, I've been having the impression that introduction of that operation may have been ill-advised, but to be fair have not spent enough time to see if a better alternative is possible yet.

jah marked 48 inline comments as done.
  • unionfs: do not create a new status object during vop_close()
  • Review feedback from olce@, fix for hung process reported by pho@

I ran tests with D45398.139944.patch for a day without observing any issues.