Page MenuHomeFreeBSD

unionfs_rename: fix numerous locking issues
ClosedPublic

Authored by jah on Apr 13 2024, 10:37 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, May 19, 1:03 PM
Unknown Object (File)
Fri, May 3, 10:21 AM
Unknown Object (File)
Mon, Apr 29, 1:27 AM
Unknown Object (File)
Fri, Apr 26, 8:40 PM
Unknown Object (File)
Fri, Apr 26, 5:10 AM
Unknown Object (File)
Sat, Apr 20, 8:22 PM
Unknown Object (File)
Apr 18 2024, 6:25 PM
Unknown Object (File)
Apr 15 2024, 8:45 PM
Subscribers

Details

Summary

There are a few places in which unionfs_rename() accesses fvp's private
data without holding the necessary lock/interlock. Moreover, the
implementation completely fails to handle the case in which fdvp is not
the same as tdvp; in this case it simply fails to lock fdvp at all.
Finally, it locks fvp while already holding tvp's lock (if tvp is not
NULL and not the same as tdvp), but makes no attempt to deal with
possible LOR there.

Fix this by optimistically using the vnode interlock to protect
the short accesses to fdvp and fvp private data, sequentially.
If a file copy or shadow directory creation is required to prepare
the upper FS for the rename operation, fdvp and fvp must be locked,
so in this case the interlock as well as the tdvp (if different
from fdvp) and tvp locks are dropped prior to locking fdvp and fvp.

This change also removes an apparently-useless call to
unionfs_relookup_for_delete():
--For the case in which fdvp and tdvp are the same, the subsequent

call to unionfs_relookup_for_rename() on tdvp will drop fdvp's
lock, effectively rendering the prior relookup_for_delete()
useless.

--If fdvp and tdvp are different, the underlying filesystem must

be prepared for the rename operation to be issued without locks
held on fdvp/fvp anyway.

Most filesystems seem to unconditionally unlock and then relock
all (potentially up to 4) vnodes in a deadlock-avoiding manner,
so the usefulness of any relookup in unionfs_rename() is
questionable to begin with.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jah requested review of this revision.Apr 13 2024, 10:37 PM
sys/fs/unionfs/union_vnops.c
1304

TBH I'm skeptical of the usefulness of *any* of the various unionfs_relookup operations. The locking around them is often questionable or wrong, and given that relocking is quite common for modern filesystems with any nontrivial VOP anyway, I'm unsure what they actually accomplish on behalf of the underlying FS.

Could you try to (greatly) simplify unionfs rename by using ERELOOKUP? For instance, it can be split into two essentially independent cases: 1. need to copy fdvp from lower to upper (and return ERELOOKUP) 2. Just directly call VOP_RENAME() on upper if copy is not needed.

This revision is now accepted and ready to land.Apr 14 2024, 2:28 PM
In D44788#1020860, @kib wrote:

Could you try to (greatly) simplify unionfs rename by using ERELOOKUP? For instance, it can be split into two essentially independent cases: 1. need to copy fdvp from lower to upper (and return ERELOOKUP) 2. Just directly call VOP_RENAME() on upper if copy is not needed.

I think that's a good idea; ERELOOKUP is probably what we really want to use in most (all?) of the cases in which we currently use unionfs_relookup_*. There will be some penalty for making the vfs_syscall layer re-run the entire lookup instead of re-running only the last level, but those cases are never on the fast path anyway.

I ran a 5 hour test with D44788.137005.patch added. I did not observe any issues.

olce requested changes to this revision.Apr 15 2024, 5:26 PM

The main problem that I see with these changes is that they lead to dropping support for FSes with non-recursive locking (see some of the inline comments). I think there are also some minor problems in the locking/relookup logic (see the inline comments as well).

Besides, unionfs_rename() still has numerous problems beyond locking, and I'm wondering if it's worth it for everybody to pursue into this direction before I've started the unionfs project, which will include an overhaul of its fundamentals. It will probably take less time to mostly rewrite it from there then to try to fix all these deficiencies, especially given that the most fundamental ones are not readily visible even with runs of stress2.

In D44788#1020876, @jah wrote:
In D44788#1020860, @kib wrote:

Could you try to (greatly) simplify unionfs rename by using ERELOOKUP? For instance, it can be split into two essentially independent cases: 1. need to copy fdvp from lower to upper (and return ERELOOKUP) 2. Just directly call VOP_RENAME() on upper if copy is not needed.

Splitting the need to copy-up before calling VOP_RENAME() is a necessity, independently of ERELOOKUP, to be able to restart/cancel an operation that is interrupted (by, e.g., a system crash). With ERELOOKUP, part of the code should go into or be called from unionfs_lookup() instead. I doubt this will simplify things per se, i.e., more than extracting the code to a helper function would do. Later on, as placeholders are implemented, no such copy should even be necessary, which makes apparent that unionfs_lookup() is not a good place to make that decision/undertake that action.

I think that's a good idea; ERELOOKUP is probably what we really want to use in most (all?) of the cases in which we currently use unionfs_relookup_*. There will be some penalty for making the vfs_syscall layer re-run the entire lookup instead of re-running only the last level, but those cases are never on the fast path anyway.

ERELOOKUP restarts the lookup at the latest reached directory, so not sure which penalty you are talking about.

sys/fs/unionfs/union_vnops.c
1203

I'd remove back this addition, which slightly obfuscates the code. tvp can never be tdvp (even ufs_rename() tests for that case, which should be removed as well). We should check for that in the generic machinery instead (kern_frenameat() takes care of this case). In the meantime, executing KASSERT_UNIONFS_VNODE() twice in this case doesn't hurt (anyway, we'll encounter more problems down the road).

1214–1215

VTOUNIONFS() already does the same check as KASSERT_UNIONFS_VNODE(), so this check is redundant.

1236

Same here.

1248

I'd set the boolean in the if only, and rename it to relock_tvp.

1249

Same here.

1271–1281

I was wondering what this code was about exactly, because both unionfs_copyfile() and unionfs_mkshadowdir() should not modify the state of any locks, but actually they do because of vfs_relookup(), which locks and then unlocks the start directory.

I now see a fundamental problem with the new code: It doesn't support FSes with non-recursive locking (FFS explicitly supports that, but, e.g., tmpfs doesn't). That's due to locking fdvp, by the vn_lock(fdvp, LK_EXCLUSIVE | LK_RETRY) above (well, if tdvp != fdvp), which the old code doesn't do.

First option is not to lock fdvp at all here. unionfs_copyfile() accesses the containing directory of the union node representing the file to copy via un_dvp and unionfs_mkshadowdir() passes rfdvp, both with an active reference so the vnode can't vanish (but may become doomed). I'm not saying this is "correct", I've in fact planned to completely remove un_dvp. But that would be doable in the meantime.

Second option is to remove vfs_relookup() calls from unionfs and provide a simpler, correct replacement routine that can take the start directory locked. With the invariant of this routine always keeping the start directory lock, you can then just remove this if (fdvp == tdvp) block completely. vfs_relookup() is itself severely outdated and anyway needs to die (or be rewritten) completely, but that's another matter. And don't get me started on the VFS' rename protocol... (this also, I'll likely change).

I'd much prefer the second option, since I see it as future-proof. I'm also now questioning whether trying to fix more locking issues is worth it before the unionfs rewrite has started, since doing so entails more and more digging into unionfs internals, with global impacts on the whole FS. Lots of these internals need to be rewritten anyway and a non-negligible part of the current problems will simply vanish doing so (and some are unfixable without doing so).

1304

Relookup is essential given the current VOP_RENAME() contract as soon as you have to unlock tdvp, which you're essentially doing above when unlocking to copy-up the source file. I don't think that relookup for the source directory and file is necessary at this stage.

1319–1320

We have to go through target relookup as soon as we effectively had to unlock tdvp.

1320

If you chose option 2 above, then this test simply becomes unnecessary (since relookup_tdvp can't be set in this case).

1325
1331

Same here.

1331

Second round of change: Just use the boolean set when unlocked.

1332–1333

I'd take the opportunity to remove this original dead code.

This revision now requires changes to proceed.Apr 15 2024, 5:26 PM

The main problem that I see with these changes is that they lead to dropping support for FSes with non-recursive locking (see some of the inline comments). I think there are also some minor problems in the locking/relookup logic (see the inline comments as well).

Besides, unionfs_rename() still has numerous problems beyond locking, and I'm wondering if it's worth it for everybody to pursue into this direction before I've started the unionfs project, which will include an overhaul of its fundamentals. It will probably take less time to mostly rewrite it from there then to try to fix all these deficiencies, especially given that the most fundamental ones are not readily visible even with runs of stress2.

In D44788#1020876, @jah wrote:
In D44788#1020860, @kib wrote:

Could you try to (greatly) simplify unionfs rename by using ERELOOKUP? For instance, it can be split into two essentially independent cases: 1. need to copy fdvp from lower to upper (and return ERELOOKUP) 2. Just directly call VOP_RENAME() on upper if copy is not needed.

Splitting the need to copy-up before calling VOP_RENAME() is a necessity, independently of ERELOOKUP, to be able to restart/cancel an operation that is interrupted (by, e.g., a system crash). With ERELOOKUP, part of the code should go into or be called from unionfs_lookup() instead. I doubt this will simplify things per se, i.e., more than extracting the code to a helper function would do. Later on, as placeholders are implemented, no such copy should even be necessary, which makes apparent that unionfs_lookup() is not a good place to make that decision/undertake that action.

I think that's a good idea; ERELOOKUP is probably what we really want to use in most (all?) of the cases in which we currently use unionfs_relookup_*. There will be some penalty for making the vfs_syscall layer re-run the entire lookup instead of re-running only the last level, but those cases are never on the fast path anyway.

ERELOOKUP restarts the lookup at the latest reached directory, so not sure which penalty you are talking about.

Is that true? It's been a while since I've looked, but ISTR most if not all of the generic code in vfs_syscalls.c re-issuing the full namei() operation with the original path upon encountering ERELOOKUP.

sys/fs/unionfs/union_vnops.c
1271–1281

Good catch. I need to sit down and rethink this change, because it started as a fairly simple set of fixes for wild accesses to fdvp/fvp vnode private data but quickly spiraled into something much more complex.

Here, the potential recursion stems specifically from unionfs_copyfile(), which seems to assume the parent directory vnode is not locked and uses vfs_relookup() to temporarily lock it, while unionfs_mkshadowdir() assumes the parent directory vnode (though as you mention not necessarily the unionfs parent directory vnode) is locked and uses the unionfs_relookup() wrapper which does drop the lock to avoid recursion.

As much as I'd prefer your second option above, is it actually a realistic option without the broader redesign you've described of moving the copy-up completely out of endpoint VOPs like unionfs_rename()? The specific use of vfs_relookup() for the copy-up operations seems to be driven by the need to handle EEXIST, as we seem to expect the generic VFS machinery to handle that case rather than individual VOP_CREATE() implementations.

sys/fs/unionfs/union_vnops.c
1271–1281

Re-reading your comment, I think I may have misunderstood your second suggestion.

I read your suggestion as "create simplified versions of unionfs_copyfile()/unionfs_mkshadowdir() that can take a locked parent directory and do not need to relookup at all". But is your suggestion instead to write a simplified replacement for vfs_relookup() itself which can operate on an already-locked 'dp' and then use the new function for unionfs copy-up operations?

sys/fs/unionfs/union_vnops.c
1271–1281

Good catch. I need to sit down and rethink this change, because it started as a fairly simple set of fixes for wild accesses to fdvp/fvp vnode private data but quickly spiraled into something much more complex.

Yes... We might be able to carry this through without too much complication (keeping in mind that, while this would make unionfs_rename() much more correct, it is still far from being so overall), let's see.

Here, the potential recursion stems specifically from unionfs_copyfile(), which seems to assume the parent directory vnode is not locked and uses vfs_relookup() to temporarily lock it, while unionfs_mkshadowdir() assumes the parent directory vnode (though as you mention not necessarily the unionfs parent directory vnode) is locked and uses the unionfs_relookup() wrapper which does drop the lock to avoid recursion.

unionfs_copyfile() uses vfs_relookup() to perform a last component lookup in order to check whether a copy-up of the source already exists (well, this could instead be some independently created file, but I digress). I don't know how we ended up in the current situation, but I tend to think it's the existence of un_dvp (one of the root of most evils in unionfs) that made unionfs_copyfile() determine on its own the source directory vnode (and assume it is unlocked).

You're right that unionfs_mkshadowdir() expects the (upper layer) source directory vnode to be locked instead (another pre-existing bug in unionfs_rename(), since it doesn't guarantee that at the point of the call).

In order to avoid changing too many things, although I admit it's not pretty, I'd try adding a new parameter to unionfs_copyfile() which is the upper-layer source directory. If it's NULL, then proceed as before from un_dvp (and its un_uppervp), else use the passed upper-layer directory directly (and assume it is locked). unionfs_rename() would pass unionfs_copyfile() a locked rfdvp, while other callers would continue passing NULL.

As much as I'd prefer your second option above, is it actually a realistic option without the broader redesign you've described of moving the copy-up completely out of endpoint VOPs like unionfs_rename()? The specific use of vfs_relookup() for the copy-up operations seems to be driven by the need to handle EEXIST, as we seem to expect the generic VFS machinery to handle that case rather than individual VOP_CREATE() implementations.

I wasn't sure I was understanding the last sentence, but with this:

I read your suggestion as "create simplified versions of unionfs_copyfile()/unionfs_mkshadowdir() that can take a locked parent directory and do not need to relookup at all". But is your suggestion instead to write a simplified replacement for vfs_relookup() itself which can operate on an already-locked 'dp' and then use the new function for unionfs copy-up operations?

I think I'm grasping what you mean. Yes, I meant a replacement for vfs_relookup(), doing the relookup is "necessary" (well, it is not enough, but again I digress) to avoid crushing an existing file that could have been created in the meantime (or a concurrent copy-up). unionfs_relookup() might be where to do it; if so, unionfs_vn_create_on_upper() would call it instead of vfs_relookup().

sys/fs/unionfs/union_vnops.c
1271–1281

Good catch. I need to sit down and rethink this change, because it started as a fairly simple set of fixes for wild accesses to fdvp/fvp vnode private data but quickly spiraled into something much more complex.

Yes... We might be able to carry this through without too much complication (keeping in mind that, while this would make unionfs_rename() much more correct, it is still far from being so overall), let's see.

Here, the potential recursion stems specifically from unionfs_copyfile(), which seems to assume the parent directory vnode is not locked and uses vfs_relookup() to temporarily lock it, while unionfs_mkshadowdir() assumes the parent directory vnode (though as you mention not necessarily the unionfs parent directory vnode) is locked and uses the unionfs_relookup() wrapper which does drop the lock to avoid recursion.

unionfs_copyfile() uses vfs_relookup() to perform a last component lookup in order to check whether a copy-up of the source already exists (well, this could instead be some independently created file, but I digress). I don't know how we ended up in the current situation, but I tend to think it's the existence of un_dvp (one of the root of most evils in unionfs) that made unionfs_copyfile() determine on its own the source directory vnode (and assume it is unlocked).

Yes, un_dvp is really what enables the current mess. It looks as though this was meant for situations in which a copy-up is called for but a parent directory vnode isn't readily available in the VOP. Although I suspect there are some cases in which we use unionfs_copyfile() and there *is* a (locked) parent vnode, in which case we have more of the same recursion I inadvertently introduced here.

You're right that unionfs_mkshadowdir() expects the (upper layer) source directory vnode to be locked instead (another pre-existing bug in unionfs_rename(), since it doesn't guarantee that at the point of the call).

In order to avoid changing too many things, although I admit it's not pretty, I'd try adding a new parameter to unionfs_copyfile() which is the upper-layer source directory. If it's NULL, then proceed as before from un_dvp (and its un_uppervp), else use the passed upper-layer directory directly (and assume it is locked). unionfs_rename() would pass unionfs_copyfile() a locked rfdvp, while other callers would continue passing NULL.

The fix I'm testing is even a bit simpler than that; as you say, I want to avoid changing these internal helpers too much if I can help it, as you're ultimately going to rewrite or remove them anyway.

As much as I'd prefer your second option above, is it actually a realistic option without the broader redesign you've described of moving the copy-up completely out of endpoint VOPs like unionfs_rename()? The specific use of vfs_relookup() for the copy-up operations seems to be driven by the need to handle EEXIST, as we seem to expect the generic VFS machinery to handle that case rather than individual VOP_CREATE() implementations.

I wasn't sure I was understanding the last sentence, but with this:

I read your suggestion as "create simplified versions of unionfs_copyfile()/unionfs_mkshadowdir() that can take a locked parent directory and do not need to relookup at all". But is your suggestion instead to write a simplified replacement for vfs_relookup() itself which can operate on an already-locked 'dp' and then use the new function for unionfs copy-up operations?

I think I'm grasping what you mean. Yes, I meant a replacement for vfs_relookup(), doing the relookup is "necessary" (well, it is not enough, but again I digress) to avoid crushing an existing file that could have been created in the meantime (or a concurrent copy-up). unionfs_relookup() might be where to do it; if so, unionfs_vn_create_on_upper() would call it instead of vfs_relookup().

Instead of a simplified version of vfs_relookup(), would the ultimate solution instead be to follow your design and move copy-up (or placeholder creation) to a much earlier point in the VFS operation lifecycle? In other words, if the copy-up/placeholder operation could be done at a point close enough to the initial lookup that the parent directory vnode is still locked from that lookup, while the same lookup has also already determined that the child file does not exist on the upper FS, could the need to handle EEXIST (and therefore the need for relookup) be eliminated entirely?

  • Fix fdvp lock recursion during file copy-up; use ERELOOKUP to simplify
sys/fs/unionfs/union_vnops.c
1288

Given the automatic shadow directory creation we do in unionfs_lookup(), I wonder if this code is really necessary. Of course it's always possible for another thread to concurrently delete the newly-created directory, but should we instead just ERELOOKUP in that case? Is there some other corner case I'm just not thinking of?

In D44788#1022769, @jah wrote:
  • Fix fdvp lock recursion during file copy-up; use ERELOOKUP to simplify

Currently traveling without too much Internet connection, so not a thorough review, I'll try to come back to it this week-end.

I much prefer this version: Use of flags is indeed much clearer now, and ERELOOKUP simplifies things. I thought that @kib was referring to ERELOOKUP processed by vfs_lookup(), but it is also understood by kern_renameat(). I agree in general with this approach, although I don't think its current implementation in the generic code is satisfying (this should be documented in VOP_RENAME()'s contract; additionally, NFS directly calls VOP_RENAME() and doesn't support it; note to myself: we are lacking in consistency and reliability here).

Given the automatic shadow directory creation we do in unionfs_lookup(), I wonder if this code is really necessary. Of course it's always possible for another thread to concurrently delete the newly-created directory, but should we instead just ERELOOKUP in that case? Is there some other corner case I'm just not thinking of?

In the current framework, shadow directories are expected to be created at lookup, but this is something that will become configurable. IIRC, there are other places where unionfs doesn't try too hard to guarantee their existence, and simply fails with EROFS when there are not there (on the simplistic assumption that they couldn't be created at lookup time because the mount is read-only). I think however this is a liability, so would prefer that we keep that (and actually enforce this guarantee elsewhere, through proper modularization).

Instead of a simplified version of vfs_relookup(), would the ultimate solution instead be to follow your design and move copy-up (or placeholder creation) to a much earlier point in the VFS operation lifecycle? In other words, if the copy-up/placeholder operation could be done at a point close enough to the initial lookup that the parent directory vnode is still locked from that lookup, while the same lookup has also already determined that the child file does not exist on the upper FS, could the need to handle EEXIST (and therefore the need for relookup) be eliminated entirely?

Answer is similar here, it could be but ultimately, for reliability, the same copy-up code should exist in unionfs_rename() (and all these should be factorized in one place, as much as possible), at least in the current framework. Actually, for the directory rename case, it's even worse, since all files inside should be copied-up (whether plain copies or placeholders). In the end, to call VOP_RENAME() on the upper layer, you have to relock the destinations and unlock the source ones, there always will be a window of time where someone else could mess with your copy. This is a fundamental problem of the rename protocol (and probably, VFS locking in general). In the meantime, isn't it still better to keep preparing the upper layer as close as possible to the actual VOP_RENAME() call, as long as this doesn't complicate too much your work?

In D44788#1022769, @jah wrote:
  • Fix fdvp lock recursion during file copy-up; use ERELOOKUP to simplify

Currently traveling without too much Internet connection, so not a thorough review, I'll try to come back to it this week-end.

I much prefer this version: Use of flags is indeed much clearer now, and ERELOOKUP simplifies things. I thought that @kib was referring to ERELOOKUP processed by vfs_lookup(), but it is also understood by kern_renameat(). I agree in general with this approach, although I don't think its current implementation in the generic code is satisfying (this should be documented in VOP_RENAME()'s contract; additionally, NFS directly calls VOP_RENAME() and doesn't support it; note to myself: we are lacking in consistency and reliability here).

Given the automatic shadow directory creation we do in unionfs_lookup(), I wonder if this code is really necessary. Of course it's always possible for another thread to concurrently delete the newly-created directory, but should we instead just ERELOOKUP in that case? Is there some other corner case I'm just not thinking of?

In the current framework, shadow directories are expected to be created at lookup, but this is something that will become configurable. IIRC, there are other places where unionfs doesn't try too hard to guarantee their existence, and simply fails with EROFS when there are not there (on the simplistic assumption that they couldn't be created at lookup time because the mount is read-only). I think however this is a liability, so would prefer that we keep that (and actually enforce this guarantee elsewhere, through proper modularization).

Instead of a simplified version of vfs_relookup(), would the ultimate solution instead be to follow your design and move copy-up (or placeholder creation) to a much earlier point in the VFS operation lifecycle? In other words, if the copy-up/placeholder operation could be done at a point close enough to the initial lookup that the parent directory vnode is still locked from that lookup, while the same lookup has also already determined that the child file does not exist on the upper FS, could the need to handle EEXIST (and therefore the need for relookup) be eliminated entirely?

Answer is similar here, it could be but ultimately, for reliability, the same copy-up code should exist in unionfs_rename() (and all these should be factorized in one place, as much as possible), at least in the current framework. Actually, for the directory rename case, it's even worse, since all files inside should be copied-up (whether plain copies or placeholders). In the end, to call VOP_RENAME() on the upper layer, you have to relock the destinations and unlock the source ones, there always will be a window of time where someone else could mess with your copy. This is a fundamental problem of the rename protocol (and probably, VFS locking in general). In the meantime, isn't it still better to keep preparing the upper layer as close as possible to the actual VOP_RENAME() call, as long as this doesn't complicate too much your work?

Oh, definitely. I was just speculating if that could be done as a longer term goal (although you make a good point that correct handling of copy-up for non-empty directories may make this idea impractical anyway). After the change in this review, my only plans for unionfs are to try to get my WiP single-lock changes ready for review within the next couple of weeks, but of course we then still may decide those changes aren't worth taking before the big redesign. After that, my plan is to get out of your way so that you can work on the redesign.

I ran a 5 hour test with D44788.137374.patch added. I did not observe any new issues.

I don't think that, as it stands, the changes here now introduce new LORs or additional recursion, so good to go.

In D44788#1023360, @jah wrote:

Oh, definitely. I was just speculating if that could be done as a longer term goal (although you make a good point that correct handling of copy-up for non-empty directories may make this idea impractical anyway).

I'm not completely sure how to handle it. I didn't evoke the topic in the proposal document. Ideally, this would be handled atomically, so requires cooperation from the VFS layer. This seems to be tied to a change in the VFS layer's locking and access protocol, for which I don't have all the details ready. If this path is not workable, then "best effort", in the form of trying to ensure FS state is good for unionfs' rename (and other operations) to proceed, as much as possible (and this won't always be possible, e.g., if a directory in the lower layer becomes shadowed by a new upper layer non-directory file while an operation such as rename is happening on the directory initially seen through the union view), should be the minimum guarantee, and is equivalent to a different ordering of operations with each one seen as atomic for non-too-contorted cases.

After the change in this review, my only plans for unionfs are to try to get my WiP single-lock changes ready for review within the next couple of weeks, but of course we then still may decide those changes aren't worth taking before the big redesign. After that, my plan is to get out of your way so that you can work on the redesign.

No problem, a couple of weeks is fine, it's not as if we didn't have important problems in other areas, unfortunately. And I think your work is worth it, even if it were not integrated in the end, since you're bound to mostly face the same problems I'd have to anyway. This is a large size piece of work. Thanks for working on it.

This revision is now accepted and ready to land.Sun, Apr 21, 12:52 PM
This revision was automatically updated to reflect the committed changes.