- Fix fdvp lock recursion during file copy-up; use ERELOOKUP to simplify
- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Today
Wed, Apr 17
Tue, Apr 16
Mon, Apr 15
In D44788#1021114, @olce wrote:The main problem that I see with these changes is that they lead to dropping support for FSes with non-recursive locking (see some of the inline comments). I think there are also some minor problems in the locking/relookup logic (see the inline comments as well).
Besides, unionfs_rename() still has numerous problems beyond locking, and I'm wondering if it's worth it for everybody to pursue into this direction before I've started the unionfs project, which will include an overhaul of its fundamentals. It will probably take less time to mostly rewrite it from there then to try to fix all these deficiencies, especially given that the most fundamental ones are not readily visible even with runs of stress2.
In D44788#1020876, @jah wrote:In D44788#1020860, @kib wrote:Could you try to (greatly) simplify unionfs rename by using ERELOOKUP? For instance, it can be split into two essentially independent cases: 1. need to copy fdvp from lower to upper (and return ERELOOKUP) 2. Just directly call VOP_RENAME() on upper if copy is not needed.
Splitting the need to copy-up before calling VOP_RENAME() is a necessity, independently of ERELOOKUP, to be able to restart/cancel an operation that is interrupted (by, e.g., a system crash). With ERELOOKUP, part of the code should go into or be called from unionfs_lookup() instead. I doubt this will simplify things per se, i.e., more than extracting the code to a helper function would do. Later on, as placeholders are implemented, no such copy should even be necessary, which makes apparent that unionfs_lookup() is not a good place to make that decision/undertake that action.
I think that's a good idea; ERELOOKUP is probably what we really want to use in most (all?) of the cases in which we currently use unionfs_relookup_*. There will be some penalty for making the vfs_syscall layer re-run the entire lookup instead of re-running only the last level, but those cases are never on the fast path anyway.
ERELOOKUP restarts the lookup at the latest reached directory, so not sure which penalty you are talking about.
Sun, Apr 14
In D44788#1020860, @kib wrote:Could you try to (greatly) simplify unionfs rename by using ERELOOKUP? For instance, it can be split into two essentially independent cases: 1. need to copy fdvp from lower to upper (and return ERELOOKUP) 2. Just directly call VOP_RENAME() on upper if copy is not needed.
Sat, Apr 13
Tue, Apr 9
Sun, Apr 7
Wed, Apr 3
In D44601#1017057, @olce wrote:These changes plug holes indeed.
Side note: It's likely that I'll rewrite the whole lookup code at some point, so I'll have to test again for races. The problem with this kind of bugs is that they are triggered only by rare races. We already have stress2 which is great, but also relies on "chance". This makes me think that perhaps we could have some more systematic framework triggering vnode dooming, let's say, at unlock. I'll probably explore that at some point.
Tue, Apr 2
Sun, Mar 24
Mar 16 2024
Code review feedback, also remove a nonsensical check from unionfs_link()
Mar 10 2024
Mar 4 2024
Feb 29 2024
Incorporate code review feedback from olce@
Feb 24 2024
This basically amounts to a generalized version of the mkdir()-specific fix I made last year in commit 93fe61afde72e6841251ea43551631c30556032d (of course in that commit I also inadvertently added a potential v_usecount ref leak on the new vnode). Or I guess it can be thought of as a tailored version of null_bypass().
Feb 23 2024
Only clear LK_SHARED
Only allow lkflags to be 0 when the corresponding vnode is NULL
In D44046#1004894, @kib wrote:In D44046#1004891, @jah wrote:In D44046#1004890, @kib wrote:So might be just allow zero flags if corresponding vp is NULL?
Sure, we could do that, but I'm curious: is there some reason why we should care what the lockflags are if there is no vnode to lock? What I have here seems more straightforward than making specific allowances for NULL vnodes.
I am about the reliable checking for the API contracts. Assume that some function calls vn_lock_pair() with externally-specified flags, and corresponding vp could be NULL sometimes. I want such calls to always have correct flags, esp. if vp != NULL is rare or could not be easily checked by normal testing.
In D44046#1004890, @kib wrote:So might be just allow zero flags if corresponding vp is NULL?
In D44046#1004867, @kib wrote:May I ask why? This allows to pass any flags for null vnodes.
This is needed for upcoming work to adopt VOP_UNP_* in unionfs.
Feb 18 2024
Feb 13 2024
In D43815#1000687, @jah wrote:In D43815#1000600, @olce wrote:In D43815#1000340, @jah wrote:In D43815#1000302, @olce wrote:I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.
Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.
That's not true; vref() does nothing to prevent a forced unmount from dooming the vnode, only holding its lock does this. As such, if the lock needs to be transiently dropped for some reason and the timing is sufficiently unfortunate, the concurrent recursive forced unmount can first unmount unionfs (dooming the unionfs vnode) and then the base FS (dooming the lower/upper vnode). The held references prevent the vnodes from being recycled (but not doomed), but even this isn't foolproof: for example, in the course of being doomed, the unionfs vnode will drop its references on the lower/upper vnodes, at which point they may become unreferenced unless additional action is taken. Whatever caller invoked the unionfs VOP will of course still hold a reference on the unionfs vnode, but this does not automatically guarantee that references will be held on the underlying vnodes for the duration of the call, due to the aforementioned scenario.
There is a misunderstanding. I'm very well aware of what you are saying, as you should know. But this is not my point, which concerns the sentence "Use of [vnode]->v_mount is unsafe in the presence of a concurrent forced unmount." in the context of the current change. The bulk of the latter is modifications of unionfs_vfsops.c, which contains VFS operations, and not vnode ones. There are no vnodes involved there, except accessing the layers' root ones. And what I'm saying, and that I proved above is that v_mount on these, again in the context of a VFS operation, cannot be NULL because of a force unmount (if you disagree, then please show where you think there is a flaw in the reasoning).
Actually the assertion about VFS operations isn't entirely true either (mostly, but not entirely); see the vfs_unbusy() dance we do in unionfs_quotactl().
But saying this makes me realize I actually need to bring back the atomic_load there (albeit the load should be of ump->um_uppermp now).Otherwise your assertion should be correct, and indeed I doubt the two read-only VOPs in question would have these locking issues in practice.
I think the source of the misunderstanding here is that I just didn't word the commit message very well. Really what I meant there is what I said in a previous comment here: If we need to cache the mount objects anyway, it's better to use them everywhere to avoid the pitfalls of potentially accessing ->v_mount when it's unsafe to do so.
Restore volatile load from ump in quotactl()
In D43815#1000600, @olce wrote:In D43815#1000340, @jah wrote:In D43815#1000302, @olce wrote:I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.
Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.
That's not true; vref() does nothing to prevent a forced unmount from dooming the vnode, only holding its lock does this. As such, if the lock needs to be transiently dropped for some reason and the timing is sufficiently unfortunate, the concurrent recursive forced unmount can first unmount unionfs (dooming the unionfs vnode) and then the base FS (dooming the lower/upper vnode). The held references prevent the vnodes from being recycled (but not doomed), but even this isn't foolproof: for example, in the course of being doomed, the unionfs vnode will drop its references on the lower/upper vnodes, at which point they may become unreferenced unless additional action is taken. Whatever caller invoked the unionfs VOP will of course still hold a reference on the unionfs vnode, but this does not automatically guarantee that references will be held on the underlying vnodes for the duration of the call, due to the aforementioned scenario.
There is a misunderstanding. I'm very well aware of what you are saying, as you should know. But this is not my point, which concerns the sentence "Use of [vnode]->v_mount is unsafe in the presence of a concurrent forced unmount." in the context of the current change. The bulk of the latter is modifications of unionfs_vfsops.c, which contains VFS operations, and not vnode ones. There are no vnodes involved there, except accessing the layers' root ones. And what I'm saying, and that I proved above is that v_mount on these, again in the context of a VFS operation, cannot be NULL because of a force unmount (if you disagree, then please show where you think there is a flaw in the reasoning).
Feb 12 2024
In D43815#1000302, @olce wrote:In D43815#1000214, @jah wrote:In D43815#1000171, @olce wrote:If one of the layer if forcibly unmounted, there isn't much point in continuing operation. But, given the first point above, that cannot even happen. So really the only case when v_mount can get NULL is the ZFS rollback's one (the layers' root vnodes can't be recycled since they are vrefed). Thinking more about it, always testing if these are alive and well is going to be inevitable going forward. But I'm fine with this change as it is for now.
This can indeed happen, despite the first point above. If a unionfs VOP ever temporarily drops its lock, another thread is free to stage a recursive forced unmount of both the unionfs and the base FS during this window. Moreover, it's easy for this to happen without unionfs even being aware of it: because unionfs shares its lock with the base FS, if a base FS VOP (forwarded by a unionfs VOP) needs to drop the lock temporarily (this is common e.g. for FFS operations that need to update metadata), the unionfs vnode may effectively be unlocked during that time. That last point is a particularly dangerous one; I have another pending set of changes to deal with the problems that can arise in that situation.
This is why I say it's easy to make a mistake in accessing [base vp]->v_mount at an unsafe time.
I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.
Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.
In D43815#1000171, @olce wrote:In D43815#999937, @jah wrote:Well, as it is today unmounting of the base FS is either recursive or it doesn't happen at all (i.e. the unmount attempt is rejected immediately because of the unionfs stacked atop the mount in question). I don't think it can work any other way, although I could see the default settings around recursive unmounts changing (maybe vfs.recursive_forced_unmount being enabled by default, or recursive unmounts even being allowed for the non-forced case as well). I don't have plans to change any of those defaults though.
I was asking because I was fearing that the unmount could proceed in the non-recursive case, but indeed it's impossible (handled by the !TAILQ_EMPTY(&mp->mnt_uppers) test in dounmount()). For the default value itself, for now I think it is fine as it is (prevents unwanted foot-shooting).
For the changes here, you're right that the first reason isn't an issue as long as the unionfs vnode is locked when the [base_vp]->v_mount access happens, as the unionfs unmount can't complete while the lock is held which then prevents the base FS from being unmounted. But it's also easy to make a mistake there, e.g. in cases where the unionfs lock is temporarily dropped, so if the base mount objects need to be cached anyway because of the ZFS case then it makes sense to just use them everywhere.
If one of the layer if forcibly unmounted, there isn't much point in continuing operation. But, given the first point above, that cannot even happen. So really the only case when v_mount can get NULL is the ZFS rollback's one (the layers' root vnodes can't be recycled since they are vrefed). Thinking more about it, always testing if these are alive and well is going to be inevitable going forward. But I'm fine with this change as it is for now.
In D43818#999955, @olce wrote:OK as a workaround. Hopefully, we'll get OpenZFS fixed soon. If you don't plan to, I may try to submit a patch upstream, since it seems no one has proposed any change in https://github.com/openzfs/zfs/issues/15705.
Feb 11 2024
In D43815#999912, @olce wrote:I think this goes in the right direction long term also.
Longer term, do you have any thoughts on only supporting recursive unmounting, regardless of whether forced or not? This would eliminate the first reason evoked in the commit message.
Update comment
Sadly my attempt at something less hacky didn't really improve things.
Putting this on hold, as I'm evaluating a less-hacky approach.
Feb 10 2024
Also filed https://github.com/openzfs/zfs/issues/15705, as I think that would benefit OpenZFS as well.
Jan 2 2024
Dec 24 2023
Dec 1 2023
Apply code review feedback from markj
Nov 30 2023
Nov 24 2023
Eliminate extraneous call to vm_phys_find_range()
Nov 23 2023
Avoid allocation in the ERANGE case, assert that return status is ENOMEM if not 0/ERANGE.
Nov 21 2023
Nov 16 2023
Nov 15 2023
Nov 13 2023
Nov 12 2023
Nov 4 2023
Oct 2 2023
From the original PR it also sounds as though this sort of refcounting issue is a common problem with drivers that use the clone facility? Could clone_create() be changed to automatically add the reference to an existing device, or perhaps a wrapper around clone_create() that does this automatically? Or would that merely create different complications elsewhere?
In D42008#958212, @kib wrote:Devfs clones is a way to handle (reserve) unit numbers. It seems that phk decided that the least involved way to code it is to just keep whole cdev with the unit number somewhere (on the clone list). These clones are not referenced, they exist by mere fact being on the clone list. When device driver allocates clone, it must make it fully correct, including the ref count.
References on cdev protect freeing of the device memory, they do not determine the lifecycle of the device. Device is created with make_dev() and destroyed with destroy_dev(), the later does not free the memory and does not even drop a reference. Devfs nodes are managed out of the driver context, by combination of dev_clone eventhandler and devfs_populate_loop() top-level code. Eventhandler is supposed to return device with additional reference to protect against parallel populate loop, and loop is the code which usually dereferences the last ref on destroyed (in destroy_dev() sense) device.
So typical driver does not need to manage dev_ref()/dev_rel() except for initial device creation, where clones and dev_clone context add some complications.
Sep 28 2023
I've never used the clone KPIs before, so please forgive my ignorance in asking a couple of basic questions:
Sep 21 2023
Jul 24 2023
In D40883#931131, @mjg wrote:huh, you just made me realize the committed change is buggy in that it fails to unlock dvp. i'll fix it up soon.
Jul 7 2023
Looks like a similar cleanup can be done in the needs_exclusive_leaf case at the end of vfs_lookup().