Page MenuHomeFreeBSD

unionfs: cache upper/lower mount objects
ClosedPublic

Authored by jah on Feb 10 2024, 4:31 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Apr 27, 8:01 AM
Unknown Object (File)
Sat, Apr 27, 7:58 AM
Unknown Object (File)
Sat, Apr 27, 7:58 AM
Unknown Object (File)
Sat, Apr 27, 7:57 AM
Unknown Object (File)
Sat, Apr 27, 7:57 AM
Unknown Object (File)
Sat, Apr 27, 6:17 AM
Unknown Object (File)
Sat, Apr 27, 1:33 AM
Unknown Object (File)
Fri, Apr 26, 11:43 PM
Subscribers

Details

Summary

Store the upper/lower FS mount objects in unionfs per-mount data and
use these instead of the v_mount field of the upper/lower root
vnodes. Use of [vnode]->v_mount is unsafe in the presence of a
concurrent forced unmount. Furthermore, as described in the referenced
PR, it is unsafe even on the unionfs unmount path (which is required
to execute before any forced unmount of the lower FS), as ZFS rollback
may have obliterated the v_mount field of the upper or lower root
vnode.

PR: 275870
Reported by: Karlo Miličević <karlo98.m@gmail.com>
Tested by: Karlo Miličević <karlo98.m@gmail.com>

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jah requested review of this revision.Feb 10 2024, 4:31 PM
kib added inline comments.
sys/fs/unionfs/union_vnops.c
767
838
This revision is now accepted and ready to land.Feb 11 2024, 6:26 AM
This revision now requires review to proceed.Feb 11 2024, 7:12 AM

I think this goes in the right direction long term also.

Longer term, do you have any thoughts on only supporting recursive unmounting, regardless of whether forced or not? This would eliminate the first reason evoked in the commit message.

This revision is now accepted and ready to land.Feb 11 2024, 5:36 PM
jah marked 2 inline comments as done.Feb 11 2024, 6:22 PM
In D43815#999912, @olce wrote:

I think this goes in the right direction long term also.

Longer term, do you have any thoughts on only supporting recursive unmounting, regardless of whether forced or not? This would eliminate the first reason evoked in the commit message.

Well, as it is today unmounting of the base FS is either recursive or it doesn't happen at all (i.e. the unmount attempt is rejected immediately because of the unionfs stacked atop the mount in question). I don't think it can work any other way, although I could see the default settings around recursive unmounts changing (maybe vfs.recursive_forced_unmount being enabled by default, or recursive unmounts even being allowed for the non-forced case as well). I don't have plans to change any of those defaults though.

For the changes here, you're right that the first reason isn't an issue as long as the unionfs vnode is locked when the [base_vp]->v_mount access happens, as the unionfs unmount can't complete while the lock is held which then prevents the base FS from being unmounted. But it's also easy to make a mistake there, e.g. in cases where the unionfs lock is temporarily dropped, so if the base mount objects need to be cached anyway because of the ZFS case then it makes sense to just use them everywhere.

In D43815#999937, @jah wrote:

Well, as it is today unmounting of the base FS is either recursive or it doesn't happen at all (i.e. the unmount attempt is rejected immediately because of the unionfs stacked atop the mount in question). I don't think it can work any other way, although I could see the default settings around recursive unmounts changing (maybe vfs.recursive_forced_unmount being enabled by default, or recursive unmounts even being allowed for the non-forced case as well). I don't have plans to change any of those defaults though.

I was asking because I was fearing that the unmount could proceed in the non-recursive case, but indeed it's impossible (handled by the !TAILQ_EMPTY(&mp->mnt_uppers) test in dounmount()). For the default value itself, for now I think it is fine as it is (prevents unwanted foot-shooting).

For the changes here, you're right that the first reason isn't an issue as long as the unionfs vnode is locked when the [base_vp]->v_mount access happens, as the unionfs unmount can't complete while the lock is held which then prevents the base FS from being unmounted. But it's also easy to make a mistake there, e.g. in cases where the unionfs lock is temporarily dropped, so if the base mount objects need to be cached anyway because of the ZFS case then it makes sense to just use them everywhere.

If one of the layer if forcibly unmounted, there isn't much point in continuing operation. But, given the first point above, that cannot even happen. So really the only case when v_mount can get NULL is the ZFS rollback's one (the layers' root vnodes can't be recycled since they are vrefed). Thinking more about it, always testing if these are alive and well is going to be inevitable going forward. But I'm fine with this change as it is for now.

In D43815#999937, @jah wrote:

Well, as it is today unmounting of the base FS is either recursive or it doesn't happen at all (i.e. the unmount attempt is rejected immediately because of the unionfs stacked atop the mount in question). I don't think it can work any other way, although I could see the default settings around recursive unmounts changing (maybe vfs.recursive_forced_unmount being enabled by default, or recursive unmounts even being allowed for the non-forced case as well). I don't have plans to change any of those defaults though.

I was asking because I was fearing that the unmount could proceed in the non-recursive case, but indeed it's impossible (handled by the !TAILQ_EMPTY(&mp->mnt_uppers) test in dounmount()). For the default value itself, for now I think it is fine as it is (prevents unwanted foot-shooting).

For the changes here, you're right that the first reason isn't an issue as long as the unionfs vnode is locked when the [base_vp]->v_mount access happens, as the unionfs unmount can't complete while the lock is held which then prevents the base FS from being unmounted. But it's also easy to make a mistake there, e.g. in cases where the unionfs lock is temporarily dropped, so if the base mount objects need to be cached anyway because of the ZFS case then it makes sense to just use them everywhere.

If one of the layer if forcibly unmounted, there isn't much point in continuing operation. But, given the first point above, that cannot even happen. So really the only case when v_mount can get NULL is the ZFS rollback's one (the layers' root vnodes can't be recycled since they are vrefed). Thinking more about it, always testing if these are alive and well is going to be inevitable going forward. But I'm fine with this change as it is for now.

This can indeed happen, despite the first point above. If a unionfs VOP ever temporarily drops its lock, another thread is free to stage a recursive forced unmount of both the unionfs and the base FS during this window. Moreover, it's easy for this to happen without unionfs even being aware of it: because unionfs shares its lock with the base FS, if a base FS VOP (forwarded by a unionfs VOP) needs to drop the lock temporarily (this is common e.g. for FFS operations that need to update metadata), the unionfs vnode may effectively be unlocked during that time. That last point is a particularly dangerous one; I have another pending set of changes to deal with the problems that can arise in that situation.

This is why I say it's easy to make a mistake in accessing [base vp]->v_mount at an unsafe time.

In D43815#1000214, @jah wrote:

If one of the layer if forcibly unmounted, there isn't much point in continuing operation. But, given the first point above, that cannot even happen. So really the only case when v_mount can get NULL is the ZFS rollback's one (the layers' root vnodes can't be recycled since they are vrefed). Thinking more about it, always testing if these are alive and well is going to be inevitable going forward. But I'm fine with this change as it is for now.

This can indeed happen, despite the first point above. If a unionfs VOP ever temporarily drops its lock, another thread is free to stage a recursive forced unmount of both the unionfs and the base FS during this window. Moreover, it's easy for this to happen without unionfs even being aware of it: because unionfs shares its lock with the base FS, if a base FS VOP (forwarded by a unionfs VOP) needs to drop the lock temporarily (this is common e.g. for FFS operations that need to update metadata), the unionfs vnode may effectively be unlocked during that time. That last point is a particularly dangerous one; I have another pending set of changes to deal with the problems that can arise in that situation.

This is why I say it's easy to make a mistake in accessing [base vp]->v_mount at an unsafe time.

I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.

Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.

In D43815#1000214, @jah wrote:

If one of the layer if forcibly unmounted, there isn't much point in continuing operation. But, given the first point above, that cannot even happen. So really the only case when v_mount can get NULL is the ZFS rollback's one (the layers' root vnodes can't be recycled since they are vrefed). Thinking more about it, always testing if these are alive and well is going to be inevitable going forward. But I'm fine with this change as it is for now.

This can indeed happen, despite the first point above. If a unionfs VOP ever temporarily drops its lock, another thread is free to stage a recursive forced unmount of both the unionfs and the base FS during this window. Moreover, it's easy for this to happen without unionfs even being aware of it: because unionfs shares its lock with the base FS, if a base FS VOP (forwarded by a unionfs VOP) needs to drop the lock temporarily (this is common e.g. for FFS operations that need to update metadata), the unionfs vnode may effectively be unlocked during that time. That last point is a particularly dangerous one; I have another pending set of changes to deal with the problems that can arise in that situation.

This is why I say it's easy to make a mistake in accessing [base vp]->v_mount at an unsafe time.

I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.

Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.

That's not true; vref() does nothing to prevent a forced unmount from dooming the vnode, only holding its lock does this. As such, if the lock needs to be transiently dropped for some reason and the timing is sufficiently unfortunate, the concurrent recursive forced unmount can first unmount unionfs (dooming the unionfs vnode) and then the base FS (dooming the lower/upper vnode). The held references prevent the vnodes from being recycled (but not doomed), but even this isn't foolproof: for example, in the course of being doomed, the unionfs vnode will drop its references on the lower/upper vnodes, at which point they may become unreferenced unless additional action is taken. Whatever caller invoked the unionfs VOP will of course still hold a reference on the unionfs vnode, but this does not automatically guarantee that references will be held on the underlying vnodes for the duration of the call, due to the aforementioned scenario.

I've seen this very thing happen during stress2 testing, but fortunately I don't think it's very hard to change unionfs to avoid running into trouble in this situation. That is what my upcoming change will do.

In D43815#1000340, @jah wrote:

I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.

Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.

That's not true; vref() does nothing to prevent a forced unmount from dooming the vnode, only holding its lock does this. As such, if the lock needs to be transiently dropped for some reason and the timing is sufficiently unfortunate, the concurrent recursive forced unmount can first unmount unionfs (dooming the unionfs vnode) and then the base FS (dooming the lower/upper vnode). The held references prevent the vnodes from being recycled (but not doomed), but even this isn't foolproof: for example, in the course of being doomed, the unionfs vnode will drop its references on the lower/upper vnodes, at which point they may become unreferenced unless additional action is taken. Whatever caller invoked the unionfs VOP will of course still hold a reference on the unionfs vnode, but this does not automatically guarantee that references will be held on the underlying vnodes for the duration of the call, due to the aforementioned scenario.

There is a misunderstanding. I'm very well aware of what you are saying, as you should know. But this is not my point, which concerns the sentence "Use of [vnode]->v_mount is unsafe in the presence of a concurrent forced unmount." in the context of the current change. The bulk of the latter is modifications of unionfs_vfsops.c, which contains VFS operations, and not vnode ones. There are no vnodes involved there, except accessing the layers' root ones. And what I'm saying, and that I proved above is that v_mount on these, again in the context of a VFS operation, cannot be NULL because of a force unmount (if you disagree, then please show where you think there is a flaw in the reasoning).

For vnode operations (VOP), this is another story. But, again in the context of this change, I doubt there is any problem for the two changed VOPs (VOP_ACCESS() and VOP_GETATTR()). Did you stumble on one?

I've seen this very thing happen during stress2 testing, but fortunately I don't think it's very hard to change unionfs to avoid running into trouble in this situation. That is what my upcoming change will do.

I'm pretty sure I've spotted problems of this kind by code analysis. Great.

In D43815#1000340, @jah wrote:

I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.

Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.

That's not true; vref() does nothing to prevent a forced unmount from dooming the vnode, only holding its lock does this. As such, if the lock needs to be transiently dropped for some reason and the timing is sufficiently unfortunate, the concurrent recursive forced unmount can first unmount unionfs (dooming the unionfs vnode) and then the base FS (dooming the lower/upper vnode). The held references prevent the vnodes from being recycled (but not doomed), but even this isn't foolproof: for example, in the course of being doomed, the unionfs vnode will drop its references on the lower/upper vnodes, at which point they may become unreferenced unless additional action is taken. Whatever caller invoked the unionfs VOP will of course still hold a reference on the unionfs vnode, but this does not automatically guarantee that references will be held on the underlying vnodes for the duration of the call, due to the aforementioned scenario.

There is a misunderstanding. I'm very well aware of what you are saying, as you should know. But this is not my point, which concerns the sentence "Use of [vnode]->v_mount is unsafe in the presence of a concurrent forced unmount." in the context of the current change. The bulk of the latter is modifications of unionfs_vfsops.c, which contains VFS operations, and not vnode ones. There are no vnodes involved there, except accessing the layers' root ones. And what I'm saying, and that I proved above is that v_mount on these, again in the context of a VFS operation, cannot be NULL because of a force unmount (if you disagree, then please show where you think there is a flaw in the reasoning).

Actually the assertion about VFS operations isn't entirely true either (mostly, but not entirely); see the vfs_unbusy() dance we do in unionfs_quotactl().
But saying this makes me realize I actually need to bring back the atomic_load there (albeit the load should be of ump->um_uppermp now).

Otherwise your assertion should be correct, and indeed I doubt the two read-only VOPs in question would have these locking issues in practice.
I think the source of the misunderstanding here is that I just didn't word the commit message very well. Really what I meant there is what I said in a previous comment here: If we need to cache the mount objects anyway, it's better to use them everywhere to avoid the pitfalls of potentially accessing ->v_mount when it's unsafe to do so.

For vnode operations (VOP), this is another story. But, again in the context of this change, I doubt there is any problem for the two changed VOPs (VOP_ACCESS() and VOP_GETATTR()). Did you stumble on one?

I've seen this very thing happen during stress2 testing, but fortunately I don't think it's very hard to change unionfs to avoid running into trouble in this situation. That is what my upcoming change will do.

I'm pretty sure I've spotted problems of this kind by code analysis. Great.

Restore volatile load from ump in quotactl()

This revision now requires review to proceed.Feb 13 2024, 12:30 PM
In D43815#1000687, @jah wrote:
In D43815#1000340, @jah wrote:

I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.

Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.

That's not true; vref() does nothing to prevent a forced unmount from dooming the vnode, only holding its lock does this. As such, if the lock needs to be transiently dropped for some reason and the timing is sufficiently unfortunate, the concurrent recursive forced unmount can first unmount unionfs (dooming the unionfs vnode) and then the base FS (dooming the lower/upper vnode). The held references prevent the vnodes from being recycled (but not doomed), but even this isn't foolproof: for example, in the course of being doomed, the unionfs vnode will drop its references on the lower/upper vnodes, at which point they may become unreferenced unless additional action is taken. Whatever caller invoked the unionfs VOP will of course still hold a reference on the unionfs vnode, but this does not automatically guarantee that references will be held on the underlying vnodes for the duration of the call, due to the aforementioned scenario.

There is a misunderstanding. I'm very well aware of what you are saying, as you should know. But this is not my point, which concerns the sentence "Use of [vnode]->v_mount is unsafe in the presence of a concurrent forced unmount." in the context of the current change. The bulk of the latter is modifications of unionfs_vfsops.c, which contains VFS operations, and not vnode ones. There are no vnodes involved there, except accessing the layers' root ones. And what I'm saying, and that I proved above is that v_mount on these, again in the context of a VFS operation, cannot be NULL because of a force unmount (if you disagree, then please show where you think there is a flaw in the reasoning).

Actually the assertion about VFS operations isn't entirely true either (mostly, but not entirely); see the vfs_unbusy() dance we do in unionfs_quotactl().
But saying this makes me realize I actually need to bring back the atomic_load there (albeit the load should be of ump->um_uppermp now).

Otherwise your assertion should be correct, and indeed I doubt the two read-only VOPs in question would have these locking issues in practice.
I think the source of the misunderstanding here is that I just didn't word the commit message very well. Really what I meant there is what I said in a previous comment here: If we need to cache the mount objects anyway, it's better to use them everywhere to avoid the pitfalls of potentially accessing ->v_mount when it's unsafe to do so.

Thinking about it a little more, I should simply remove this part of the commit message. Accessing [base_vp]->v_mount does have risks, but any code that is subject to those risks is almost certainly going to face the same risks from careless access to ump->um_[upper|lower]mp (where 'ump' was obtained by a presumably-safe load of [unionfs_vp]->v_mount->mnt_data at the beginning of the call, as most of these operations do).

For vnode operations (VOP), this is another story. But, again in the context of this change, I doubt there is any problem for the two changed VOPs (VOP_ACCESS() and VOP_GETATTR()). Did you stumble on one?

I've seen this very thing happen during stress2 testing, but fortunately I don't think it's very hard to change unionfs to avoid running into trouble in this situation. That is what my upcoming change will do.

I'm pretty sure I've spotted problems of this kind by code analysis. Great.

In D43815#1000687, @jah wrote:

There is a misunderstanding. I'm very well aware of what you are saying, as you should know. But this is not my point, which concerns the sentence "Use of [vnode]->v_mount is unsafe in the presence of a concurrent forced unmount." in the context of the current change. The bulk of the latter is modifications of unionfs_vfsops.c, which contains VFS operations, and not vnode ones. There are no vnodes involved there, except accessing the layers' root ones. And what I'm saying, and that I proved above is that v_mount on these, again in the context of a VFS operation, cannot be NULL because of a force unmount (if you disagree, then please show where you think there is a flaw in the reasoning).

Actually the assertion about VFS operations isn't entirely true either (mostly, but not entirely); see the vfs_unbusy() dance we do in unionfs_quotactl().

It is, but of course as long as the guarantee provided by executing the VFS operation (in the form of vfs_busy()) isn't defeated by the necessity of doing a vfs_unbusy(). Which unionfs_quotactl() indeed does, as you rightfully point out.

In passing, that dance in sys_quotactl()is quite inelegant and shouldn't be hard to remove, but that is work for another day.

But saying this makes me realize I actually need to bring back the atomic_load there (albeit the load should be of ump->um_uppermp now).

Yes. And I also prefer that we use an atomic operation to access v_mount, although it is not strictly necessary.

In D43815#1000700, @jah wrote:

Thinking about it a little more, I should simply remove this part of the commit message. Accessing [base_vp]->v_mount does have risks, but any code that is subject to those risks is almost certainly going to face the same risks from careless access to ump->um_[upper|lower]mp (where 'ump' was obtained by a presumably-safe load of [unionfs_vp]->v_mount->mnt_data at the beginning of the call, as most of these operations do).

I think this part is indeed confusing as it is since, while it's generally true, it doesn't apply to the places changed by this commit, which prompted my initial responses to clarify what it meant.

This revision is now accepted and ready to land.Feb 13 2024, 2:09 PM
This revision was automatically updated to reflect the committed changes.