- Queries
- All Stories
- Search
- Advanced Search
- Transactions
- Transaction Logs
Advanced Search
Feb 23 2024
Feb 18 2024
Feb 13 2024
In D43815#1000687, @jah wrote:In D43815#1000600, @olce wrote:In D43815#1000340, @jah wrote:In D43815#1000302, @olce wrote:I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.
Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.
That's not true; vref() does nothing to prevent a forced unmount from dooming the vnode, only holding its lock does this. As such, if the lock needs to be transiently dropped for some reason and the timing is sufficiently unfortunate, the concurrent recursive forced unmount can first unmount unionfs (dooming the unionfs vnode) and then the base FS (dooming the lower/upper vnode). The held references prevent the vnodes from being recycled (but not doomed), but even this isn't foolproof: for example, in the course of being doomed, the unionfs vnode will drop its references on the lower/upper vnodes, at which point they may become unreferenced unless additional action is taken. Whatever caller invoked the unionfs VOP will of course still hold a reference on the unionfs vnode, but this does not automatically guarantee that references will be held on the underlying vnodes for the duration of the call, due to the aforementioned scenario.
There is a misunderstanding. I'm very well aware of what you are saying, as you should know. But this is not my point, which concerns the sentence "Use of [vnode]->v_mount is unsafe in the presence of a concurrent forced unmount." in the context of the current change. The bulk of the latter is modifications of unionfs_vfsops.c, which contains VFS operations, and not vnode ones. There are no vnodes involved there, except accessing the layers' root ones. And what I'm saying, and that I proved above is that v_mount on these, again in the context of a VFS operation, cannot be NULL because of a force unmount (if you disagree, then please show where you think there is a flaw in the reasoning).
Actually the assertion about VFS operations isn't entirely true either (mostly, but not entirely); see the vfs_unbusy() dance we do in unionfs_quotactl().
But saying this makes me realize I actually need to bring back the atomic_load there (albeit the load should be of ump->um_uppermp now).Otherwise your assertion should be correct, and indeed I doubt the two read-only VOPs in question would have these locking issues in practice.
I think the source of the misunderstanding here is that I just didn't word the commit message very well. Really what I meant there is what I said in a previous comment here: If we need to cache the mount objects anyway, it's better to use them everywhere to avoid the pitfalls of potentially accessing ->v_mount when it's unsafe to do so.
Restore volatile load from ump in quotactl()
In D43815#1000600, @olce wrote:In D43815#1000340, @jah wrote:In D43815#1000302, @olce wrote:I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.
Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.
That's not true; vref() does nothing to prevent a forced unmount from dooming the vnode, only holding its lock does this. As such, if the lock needs to be transiently dropped for some reason and the timing is sufficiently unfortunate, the concurrent recursive forced unmount can first unmount unionfs (dooming the unionfs vnode) and then the base FS (dooming the lower/upper vnode). The held references prevent the vnodes from being recycled (but not doomed), but even this isn't foolproof: for example, in the course of being doomed, the unionfs vnode will drop its references on the lower/upper vnodes, at which point they may become unreferenced unless additional action is taken. Whatever caller invoked the unionfs VOP will of course still hold a reference on the unionfs vnode, but this does not automatically guarantee that references will be held on the underlying vnodes for the duration of the call, due to the aforementioned scenario.
There is a misunderstanding. I'm very well aware of what you are saying, as you should know. But this is not my point, which concerns the sentence "Use of [vnode]->v_mount is unsafe in the presence of a concurrent forced unmount." in the context of the current change. The bulk of the latter is modifications of unionfs_vfsops.c, which contains VFS operations, and not vnode ones. There are no vnodes involved there, except accessing the layers' root ones. And what I'm saying, and that I proved above is that v_mount on these, again in the context of a VFS operation, cannot be NULL because of a force unmount (if you disagree, then please show where you think there is a flaw in the reasoning).
Feb 12 2024
In D43815#1000302, @olce wrote:In D43815#1000214, @jah wrote:In D43815#1000171, @olce wrote:If one of the layer if forcibly unmounted, there isn't much point in continuing operation. But, given the first point above, that cannot even happen. So really the only case when v_mount can get NULL is the ZFS rollback's one (the layers' root vnodes can't be recycled since they are vrefed). Thinking more about it, always testing if these are alive and well is going to be inevitable going forward. But I'm fine with this change as it is for now.
This can indeed happen, despite the first point above. If a unionfs VOP ever temporarily drops its lock, another thread is free to stage a recursive forced unmount of both the unionfs and the base FS during this window. Moreover, it's easy for this to happen without unionfs even being aware of it: because unionfs shares its lock with the base FS, if a base FS VOP (forwarded by a unionfs VOP) needs to drop the lock temporarily (this is common e.g. for FFS operations that need to update metadata), the unionfs vnode may effectively be unlocked during that time. That last point is a particularly dangerous one; I have another pending set of changes to deal with the problems that can arise in that situation.
This is why I say it's easy to make a mistake in accessing [base vp]->v_mount at an unsafe time.
I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.
Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.
In D43815#1000171, @olce wrote:In D43815#999937, @jah wrote:Well, as it is today unmounting of the base FS is either recursive or it doesn't happen at all (i.e. the unmount attempt is rejected immediately because of the unionfs stacked atop the mount in question). I don't think it can work any other way, although I could see the default settings around recursive unmounts changing (maybe vfs.recursive_forced_unmount being enabled by default, or recursive unmounts even being allowed for the non-forced case as well). I don't have plans to change any of those defaults though.
I was asking because I was fearing that the unmount could proceed in the non-recursive case, but indeed it's impossible (handled by the !TAILQ_EMPTY(&mp->mnt_uppers) test in dounmount()). For the default value itself, for now I think it is fine as it is (prevents unwanted foot-shooting).
For the changes here, you're right that the first reason isn't an issue as long as the unionfs vnode is locked when the [base_vp]->v_mount access happens, as the unionfs unmount can't complete while the lock is held which then prevents the base FS from being unmounted. But it's also easy to make a mistake there, e.g. in cases where the unionfs lock is temporarily dropped, so if the base mount objects need to be cached anyway because of the ZFS case then it makes sense to just use them everywhere.
If one of the layer if forcibly unmounted, there isn't much point in continuing operation. But, given the first point above, that cannot even happen. So really the only case when v_mount can get NULL is the ZFS rollback's one (the layers' root vnodes can't be recycled since they are vrefed). Thinking more about it, always testing if these are alive and well is going to be inevitable going forward. But I'm fine with this change as it is for now.
In D43818#999955, @olce wrote:OK as a workaround. Hopefully, we'll get OpenZFS fixed soon. If you don't plan to, I may try to submit a patch upstream, since it seems no one has proposed any change in https://github.com/openzfs/zfs/issues/15705.
Feb 11 2024
In D43815#999912, @olce wrote:I think this goes in the right direction long term also.
Longer term, do you have any thoughts on only supporting recursive unmounting, regardless of whether forced or not? This would eliminate the first reason evoked in the commit message.
Update comment
Sadly my attempt at something less hacky didn't really improve things.
Putting this on hold, as I'm evaluating a less-hacky approach.
Feb 10 2024
Also filed https://github.com/openzfs/zfs/issues/15705, as I think that would benefit OpenZFS as well.
Jan 2 2024
Dec 24 2023
Dec 1 2023
Apply code review feedback from markj
Nov 30 2023
Nov 24 2023
Eliminate extraneous call to vm_phys_find_range()
Nov 23 2023
Avoid allocation in the ERANGE case, assert that return status is ENOMEM if not 0/ERANGE.
Nov 21 2023
Nov 16 2023
Nov 15 2023
Nov 13 2023
Nov 12 2023
Nov 4 2023
Oct 2 2023
From the original PR it also sounds as though this sort of refcounting issue is a common problem with drivers that use the clone facility? Could clone_create() be changed to automatically add the reference to an existing device, or perhaps a wrapper around clone_create() that does this automatically? Or would that merely create different complications elsewhere?
In D42008#958212, @kib wrote:Devfs clones is a way to handle (reserve) unit numbers. It seems that phk decided that the least involved way to code it is to just keep whole cdev with the unit number somewhere (on the clone list). These clones are not referenced, they exist by mere fact being on the clone list. When device driver allocates clone, it must make it fully correct, including the ref count.
References on cdev protect freeing of the device memory, they do not determine the lifecycle of the device. Device is created with make_dev() and destroyed with destroy_dev(), the later does not free the memory and does not even drop a reference. Devfs nodes are managed out of the driver context, by combination of dev_clone eventhandler and devfs_populate_loop() top-level code. Eventhandler is supposed to return device with additional reference to protect against parallel populate loop, and loop is the code which usually dereferences the last ref on destroyed (in destroy_dev() sense) device.
So typical driver does not need to manage dev_ref()/dev_rel() except for initial device creation, where clones and dev_clone context add some complications.
Sep 28 2023
I've never used the clone KPIs before, so please forgive my ignorance in asking a couple of basic questions:
Sep 21 2023
Jul 24 2023
In D40883#931131, @mjg wrote:huh, you just made me realize the committed change is buggy in that it fails to unlock dvp. i'll fix it up soon.
Jul 7 2023
Looks like a similar cleanup can be done in the needs_exclusive_leaf case at the end of vfs_lookup().
Jul 3 2023
Jun 22 2023
- Remove extraneous vhold()
Jun 20 2023
- Return the write sequence on coveredvp to the correct place, replace
Jun 19 2023
Right now this change is just a proposal. I've successfully run the unionfs and nullfs stress2 tests against it, and have also been running it on my -current machine for the last couple of months.
Of course it's entirely possible I've missed something that would make this change unworkable. But if you guys think this approach has merit, then I'll finish this patch by doing the following:
May 7 2023
May 6 2023
May 2 2023
May 1 2023
Apr 24 2023
Apr 23 2023
Apr 18 2023
Apr 10 2023
Mar 28 2023
- vfs_lookup(): re-check v_mountedhere on lock upgrade
In D39272#894822, @jah wrote:In D39272#894669, @pho wrote:I ran into this problem:
Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x10 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff82753417 stack pointer = 0x0:0xfffffe01438a8a80 frame pointer = 0x0:0xfffffe01438a8aa0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 2382 (find) rdi: fffffe015b22d700 rsi: 82000 rdx: fffffe01438a8ad8 rcx: 1 r8: 246 r9: 40000 rax: 0 rbx: fffffe015b22d700 rbp: fffffe01438a8aa0 r10: 1 r11: 0 r12: 80000 r13: fffffe01599802c0 r14: fffffe01438a8ad8 r15: 82000 trap number = 12 panic: page fault cpuid = 2 time = 1679981682 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01438a8840 vpanic() at vpanic+0x152/frame 0xfffffe01438a8890 panic() at panic+0x43/frame 0xfffffe01438a88f0 trap_fatal() at trap_fatal+0x409/frame 0xfffffe01438a8950 trap_pfault() at trap_pfault+0xab/frame 0xfffffe01438a89b0 calltrap() at calltrap+0x8/frame 0xfffffe01438a89b0 --- trap 0xc, rip = 0xffffffff82753417, rsp = 0xfffffe01438a8a80, rbp = 0xfffffe01438a8aa0 --- unionfs_root() at unionfs_root+0x17/frame 0xfffffe01438a8aa0 vfs_lookup() at vfs_lookup+0x92a/frame 0xfffffe01438a8b40 namei() at namei+0x340/frame 0xfffffe01438a8bc0 kern_statat() at kern_statat+0x12f/frame 0xfffffe01438a8d00 sys_fstatat() at sys_fstatat+0x2f/frame 0xfffffe01438a8e00 amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe01438a8f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe01438a8f30 --- syscall (552, FreeBSD ELF64, fstatat), rip = 0x1e9fa9c98, rbp = 0x1e9fa1989db0 ---https://people.freebsd.org/~pho/stress/log/log0429.txt
PS
The BIOS and IPMI firmware was just updated on this test box, but the page fault seems legit to me?Hmm. I'm probably missing something, but on initial investigation the panic doesn't seem to make sense.
It really looks as though unionfs_root() is seeing a partially constructed mount object: mp has the unionfs ops vector, but mnt_data is NULL (thus the page fault) and the stat object's fsid is 0.
This is consistent with the mount object state that would exist partway through unionfs_domount(), or if unionfs_domount() failed due to failure of unionfs_nodeget() or vfs_register_upper_from_vp(). There is another thread that appears to be partway through unionfs_domount(), and mp's busy count (mnt_lockref) of 2 is consistent with these 2 threads.
What doesn't make sense is how vfs_lookup() could observe a mount object in this state in the first place; vfs_domount_first() doesn't set coveredvp->v_mountedhere until after successful completion of VFS_MOUNT(), and it does so with coveredvp locked exclusive to avoid racing vfs_lookup().Some things to note:
--Besides a couple of added comments, this is the same patch you tested successfully at the end of January.
--Since then, I have found a couple of bugs in the cleanup logic in unionfs_domount() and unionfs_nodeget(), which I'll post in a separate review after this one, but I don't think they would explain this behavior.
--There does appear to be a third thread calling dounmount() and blocked on an FFS lock, but it's unclear if that has any impact on the crash.
In D39272#894669, @pho wrote:I ran into this problem:
Fatal trap 12: page fault while in kernel mode cpuid = 2; apic id = 02 fault virtual address = 0x10 fault code = supervisor read data, page not present instruction pointer = 0x20:0xffffffff82753417 stack pointer = 0x0:0xfffffe01438a8a80 frame pointer = 0x0:0xfffffe01438a8aa0 code segment = base 0x0, limit 0xfffff, type 0x1b = DPL 0, pres 1, long 1, def32 0, gran 1 processor eflags = interrupt enabled, resume, IOPL = 0 current process = 2382 (find) rdi: fffffe015b22d700 rsi: 82000 rdx: fffffe01438a8ad8 rcx: 1 r8: 246 r9: 40000 rax: 0 rbx: fffffe015b22d700 rbp: fffffe01438a8aa0 r10: 1 r11: 0 r12: 80000 r13: fffffe01599802c0 r14: fffffe01438a8ad8 r15: 82000 trap number = 12 panic: page fault cpuid = 2 time = 1679981682 KDB: stack backtrace: db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe01438a8840 vpanic() at vpanic+0x152/frame 0xfffffe01438a8890 panic() at panic+0x43/frame 0xfffffe01438a88f0 trap_fatal() at trap_fatal+0x409/frame 0xfffffe01438a8950 trap_pfault() at trap_pfault+0xab/frame 0xfffffe01438a89b0 calltrap() at calltrap+0x8/frame 0xfffffe01438a89b0 --- trap 0xc, rip = 0xffffffff82753417, rsp = 0xfffffe01438a8a80, rbp = 0xfffffe01438a8aa0 --- unionfs_root() at unionfs_root+0x17/frame 0xfffffe01438a8aa0 vfs_lookup() at vfs_lookup+0x92a/frame 0xfffffe01438a8b40 namei() at namei+0x340/frame 0xfffffe01438a8bc0 kern_statat() at kern_statat+0x12f/frame 0xfffffe01438a8d00 sys_fstatat() at sys_fstatat+0x2f/frame 0xfffffe01438a8e00 amd64_syscall() at amd64_syscall+0x15a/frame 0xfffffe01438a8f30 fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe01438a8f30 --- syscall (552, FreeBSD ELF64, fstatat), rip = 0x1e9fa9c98, rbp = 0x1e9fa1989db0 ---https://people.freebsd.org/~pho/stress/log/log0429.txt
PS
The BIOS and IPMI firmware was just updated on this test box, but the page fault seems legit to me?
Mar 26 2023
In D39272#894131, @kib wrote:For VOP_MKDIR() change. Please note that almost any VOP modifying metadata could drop the vnode lock. The list is like VOP_CREAT(), VOP_LINK(), VOP_REMOVE(), VOP_WHITEOUT(), VOP_RMDIR(), VOP_MAKEINODE(), VOP_RENAME().
Feb 10 2023
Jan 19 2023
Jan 16 2023
In D38091#865308, @kib wrote:Could you please show an example of the new output?