Page MenuHomeFreeBSD

jah (Jason A. Harmening)
User

Projects

User Details

User Since
Mar 25 2015, 11:24 AM (497 w, 3 d)

Recent Activity

Mon, Sep 9

jah committed rG8fa5e0f21fd1: tmpfs: Account for whiteouts during rename/rmdir (authored by jah).
tmpfs: Account for whiteouts during rename/rmdir
Mon, Sep 9, 12:02 AM
jah committed rG2ed053cde558: vfs: Add IGNOREWHITEOUT flag and adopt it in UFS/unionfs (authored by jah).
vfs: Add IGNOREWHITEOUT flag and adopt it in UFS/unionfs
Mon, Sep 9, 12:02 AM
jah closed D45987: tmpfs: Account for whiteouts during rename/rmdir.
Mon, Sep 9, 12:02 AM

Aug 14 2024

jah updated the diff for D45987: tmpfs: Account for whiteouts during rename/rmdir.
  • Clarifying tweaks for the tmpfs implementation
Aug 14 2024, 1:44 AM

Aug 6 2024

jah added a comment to D45987: tmpfs: Account for whiteouts during rename/rmdir.

Since whiteout-only directory removal will no longer succeed by default outside the union view, should we also change 'rm -rf' to allow whiteouts to be forcibly removed? As @olce alluded to earlier, it looks as though this might just be a one-line change in bin/rm/rm.c to specify FTS_WHITEOUT in the 'fflag' case.

Aug 6 2024, 5:03 AM
jah added a reviewer for D45987: tmpfs: Account for whiteouts during rename/rmdir: mckusick.
Aug 6 2024, 4:42 AM
jah updated the diff for D45987: tmpfs: Account for whiteouts during rename/rmdir.

Add IGNOREWHITEOUT flag and adopt in UFS and tmpfs.

Aug 6 2024, 4:40 AM

Aug 5 2024

jah added a comment to D45987: tmpfs: Account for whiteouts during rename/rmdir.

Would you perhaps rename DROPWHITEOUT to IGNOREWHITEOUT? I may want to reuse this flag at some point for VOP_READDIR().

Aug 5 2024, 6:37 PM
jah attached a referenced file: F89850596: patch_dropwhiteout.
Aug 5 2024, 4:15 PM
jah changed the visibility for F89850596: patch_dropwhiteout.
Aug 5 2024, 4:14 PM

Aug 2 2024

jah added a comment to D45987: tmpfs: Account for whiteouts during rename/rmdir.

@olce Reading back through the discussion here, I'm not sure we're in as much disagreement as it initially seemed.

Aug 2 2024, 2:27 AM

Jul 26 2024

jah added a comment to D45987: tmpfs: Account for whiteouts during rename/rmdir.
In D45987#1052043, @jah wrote:

Yes, discarding the whiteout data does strike me as odd, I've already mentioned that.

Sure, but explaining the why is important, else there's no discussion possible. I don't think anyone an architecture for unionfs where all behaviors are "unsurprising" to users at first exists. What I am seeking, though, is to have consistent, understandable, and most of the time unsurprising behaviors, at least in connection to security, data loss, etc.

But whiteouts themselves are quite odd: as you mentioned they're not directory metadata. They might be thought of as file metadata for files contained within the lower directory, but they are file metadata whose purpose is to allow the user and other parts of the system to pretend those files don't exist. Given that, a directory whose upper FS component contains only whiteouts that completely negate the contents of its lower FS component is meant to be seen as logically empty by the user.

That's some common thinking for these but I'm in clear disagreement with it. It conflates what is visible through the union view and how the latter is concretely built out of the (upper and lower) layers. Once one recognizes them as conceptually separate, there is really no reason to see a directory holding whiteouts as empty, even if these whiteouts are meant to "delete" some lower layer's files *in the union view*. Even more, considering such directories empty, in the context of the upper layer on its own, is basically discarding the existence of these whiteouts in a context where the operator should be in a position to tweak the pieces that finally make the union view at will, and thus should be presented with all of them. For people not using UFS at all for unions, or using it only as a layer backing store that they never tweak directly, the chosen behavior doesn't matter at all, since they will never have to deal with whiteouts.

Jul 26 2024, 10:52 PM
jah added a comment to D45987: tmpfs: Account for whiteouts during rename/rmdir.
In D45987#1051615, @jah wrote:

I think we need to be extremely careful about reverting UFS behavior that's been in place for 27 years

Aren't we?

just because it's inconsistent with Linux and/or ZFS.

Please... This isn't what I wrote. These are secondary, but still important, reasons.

Jul 26 2024, 2:45 PM

Jul 25 2024

jah added a comment to D45987: tmpfs: Account for whiteouts during rename/rmdir.

I do not remember why the change to allow the removing of directories with whiteouts became allowed. The changes in commit 996c772f581f56248 are quite extensive. If you can let me know which file(s) in that commit have the relevant changes, I can go back through the changes made in those files for the Lite/2 release to see if any explanation shows up.

Sure, I'm referring to:

--- a/sys/ufs/ufs/ufs_lookup.c
+++ b/sys/ufs/ufs/ufs_lookup.c
@@ -906,7 +961,7 @@ ufs_dirempty(ip, parentino, cred)
                if (dp->d_reclen == 0)
                        return (0);
                /* skip empty entries */
-               if (dp->d_ino == 0)
+               if (dp->d_ino == 0 || dp->d_ino == WINO)
                        continue;
                /* accept only "." and ".." */
 #              if (BYTE_ORDER == LITTLE_ENDIAN)

part of function ufs_dirempty(), which is (now, but if I'm not mistaken also was then) called only by ufs_rename() and ufs_rmdir().

Barring other reasons I can't see now, I think we'd better revert that behavior, both for logical reasons (the directory references no file but still has whiteout entries that we are going to lose silently) and practical ones (e.g., on Linux, whiteouts are implemented as device files, and thus rmdir() on directories holding some will fail with ENOTEMPTY; this is also to ease consistency in ZFS' code).

Jul 25 2024, 6:41 PM

Jul 17 2024

jah added a comment to D45987: tmpfs: Account for whiteouts during rename/rmdir.

I've left a pretty comprehensive comment about why we would currently prefer these semantics in the generic vn_dir_check_empty().

I'm not sure however if we really want this behavior in the future, regardless of the filesystem. Here, the emptyness test is a guard for removing a directory, which is a different situation than considering whether a directory is empty to mount over it (which is the only situation in which vn_dir_check_empty() is called currently). In the latter, we just ensure we are not shadowing data (and this cannot affect some unionfs using the filesystem of the directory mounted upon). In the present situation, the test is used to decide whether to remove a directory and, although it doesn't contain files, it does contain data in the form of whiteout entries. In this light, removing the whiteouts silently seems odd.

I suspect that removing a directory with whiteouts was allowed for UFS only because not doing so would confuse most users not aware of whiteouts, as these are not printed by a regular ls. I doubt there was any other reason than that. It appears the test on WINO wasn't in 4.4BSD and was introduced in the Lite/2 commit (996c772f581f56248, r22521). Adding @mckusick as he might have some recollection of what lead to this decision.

Jul 17 2024, 6:41 PM
jah updated the diff for D45987: tmpfs: Account for whiteouts during rename/rmdir.
  • Move clearing of rename target dir to more logical location
Jul 17 2024, 4:09 AM
jah updated the diff for D45987: tmpfs: Account for whiteouts during rename/rmdir.
  • Move tmpfs_dir_clear() calls after error checking and detachment from parent
Jul 17 2024, 3:15 AM
jah requested review of D45987: tmpfs: Account for whiteouts during rename/rmdir.
Jul 17 2024, 2:50 AM

Jul 13 2024

jah committed rG9b505845a3ae: unionfs: fix LINT build (authored by jah).
unionfs: fix LINT build
Jul 13 2024, 3:42 AM
jah committed rG53a777bb52e3: unionfs: do not create a new status object during vop_close() (authored by jah).
unionfs: do not create a new status object during vop_close()
Jul 13 2024, 1:40 AM
jah committed rGeb60ff1ee16a: unionfs: rework locking scheme to only lock a single vnode (authored by jah).
unionfs: rework locking scheme to only lock a single vnode
Jul 13 2024, 1:40 AM
jah closed D45398: unionfs: rework locking scheme to only lock a single vnode.
Jul 13 2024, 1:40 AM

Jul 6 2024

jah added inline comments to D45398: unionfs: rework locking scheme to only lock a single vnode.
Jul 6 2024, 4:40 AM

Jul 5 2024

jah updated the diff for D45398: unionfs: rework locking scheme to only lock a single vnode.
  • Use the common cleanup path upon completion of dot-dot lookup This simplifies the code and avoids attempted creation of a negative dot-dot VFS cache entry for a doomed vnode.
Jul 5 2024, 11:44 PM
jah added a comment to D45398: unionfs: rework locking scheme to only lock a single vnode.
In D45398#1046132, @pho wrote:

I got this panic with union19.sh:

20240705 20:25:38 all (1/1): unionfs19.sh
VNASSERT failed: !__builtin_expect(((_Generic(*(&(dvp)->v_irflag), short: (*(volatile u_short *)(&(dvp)->v_irflag)), u_short: VNASSERT failed: !__builtin_expect(((_Generic(*(&(dvp)->v_irflag), short: (*(volatile u_short *)(&(dvp)->v_irflag)), u_short: (*(volatile u_short *)(&(dvp)->v_irflag))) & 0x0001) != 0), 0) not true at ../../../kern/vfs_cache.c:2481 (cache_enter_time)
VNASSERT failed: !__builtin_expect(((_Generic(*(&(dvp)->v_irflag), short: (*(volatile u_short *)(&(dvp)->v_irflag)), u_short: VNASSERT failed: !__builtin_expect(((_Generic(*(&(dvp)->v_irflag), short: (*(volatile u_short *)(&(dvp)->v_irflag)), u_short: (*(volatile u_short *)(&(dvp)->v_irflag))) & 0x0001) != 0), 0) not true at ../../../kern/vfs_cache.c:2481 (cache_enter_time)
VNASSERT failed: !__builtin_expect(((_Generic(*(&(dvp)->v_irflag), short: (*(volatile u_short *)(&(dvp)->v_irflag)), u_short: (*(volatile u_short *)(&(dvp)->v_irflag))) & 0x0001) != 0), 0) not true at ../../../kern/vfs_cache.c:2481 (cache_enter_time)
0xfffffe016f682bb8: (*(volatile u_short *)(&(dvp)->v_irflag))) & 0x0001) != 0), 0) not true at ../../../kern/vfs_cache.c:2481 (cache_enter_time)
(*(volatile u_short *)(&(dvp)->v_irflag))) & 0x0001) != 0), 0) not true at ../../../kern/vfs_cache.c:2481 (cache_enter_time)
0xfffffe016f6d04b0: 0xfffffe016f1aa068: type VBAD state VSTATE_DEAD op 0xffffffff818ac760
    usecount 2, writecount 0, refcount 1 seqc users 10xfffffe016ffba068: type VBAD state VSTATE_DEAD op 0xffffffff818ac760

    hold count flags ()
0xfffffe016fed0000:     flags (VIRF_DOOMED)type VBAD state VSTATE_DEAD op 0xffffffff818ac760
    usecount 2, writecount 0, refcount 1 seqc users 1type VBAD state VSTATE_DEAD op 0xffffffff818ac760
type VBAD state VSTATE_DEAD op 0xffffffff818ac760
    usecount 2, writecount 0, refcount 1 seqc users 1
    hold count flags ()

    lock type unionfs: EXCL by thread 0xfffff802f76c7740 (pid 92551, mmap, tid 100429)

    usecount 2, writecount 0, refcount 1 seqc users 1
    usecount 2, writecount 0, refcount 1 seqc users 1    hold count flags ()
    flags (VIRF_DOOMED)    hold count flags ()

    flags (VIRF_DOOMED)    flags (VIRF_DOOMED)#0 0xffffffff80b14d98 at lockmgr_lock_flags+0x1b8
    
    #1 0xffffffff8113416a at VOP_LOCK1_APV+0x3a
lock type unionfs: EXCL by thread 0xfffff802f72a0740 (pid 92547, mmap, tid 100442)
#2 0xffffffff80c5ae03 at _vn_lock+0x53

    hold count flags ()
    flags (VIRF_DOOMED)
    #3 0xffffffff80c60017 at vn_lock_pair+0x3d7

lock type unionfs: EXCL by thread 0xfffff806a0787740 (pid 92565, mmap, tid 100465)
#0 0xffffffff80b14d98 at lockmgr_lock_flags+0x1b8
#0 0xffffffff80b14d98 at lockmgr_lock_flags+0x1b8
#1 0xffffffff8113416a at VOP_LOCK1_APV+0x3a
lock type unionfs: EXCL by thread 0xfffff802132f4740 (pid 92567, mmap, tid 100436)
#1 0xffffffff8113416a at VOP_LOCK1_APV+0x3a
    #0 0xffffffff80b14d98 at lockmgr_lock_flags+0x1b8
#2 0xffffffff80c5ae03 at _vn_lock+0x53
#4 0xffffffff8284e086 at unionfs_forward_vop_finish_pair+0x176
#1 0xffffffff8113416a at VOP_LOCK1_APV+0x3a
#3 0xffffffff80c60017 at vn_lock_pair+0x3d7
#2 0xffffffff80c5ae03 at _vn_lock+0x53
#5 0xffffffff8285014d at unionfs_lookup+0x22d
#2 0xffffffff80c5ae03 at _vn_lock+0x53
#3 0xffffffff80c60017 at vn_lock_pair+0x3d7
#6 0xffffffff81130c0f at VOP_CACHEDLOOKUP_APV+0x5f
lock type unionfs: EXCL by thread 0xfffff8021314b740 (pid 92569, mmap, tid 100415)
#3 0xffffffff80c60017 at vn_lock_pair+0x3d7
#7 0xffffffff80c20c36 at vfs_cache_lookup+0xa6
#0 0xffffffff80b14d98 at lockmgr_lock_flags+0x1b8
#4 0xffffffff8284e086 at unionfs_forward_vop_finish_pair+0x176
#8 0xffffffff81130a4f at VOP_LOOKUP_APV+0x5f
#1 0xffffffff8113416a at VOP_LOCK1_APV+0x3a
#9 0xffffffff80c321c7 at vfs_lookup+0x487
#2 0xffffffff80c5ae03 at _vn_lock+0x53
#10 0xffffffff80c312e1 at namei+0x2d1
#4 0xffffffff8284e086 at unionfs_forward_vop_finish_pair+0x176
#11 0xffffffff80c50223 at kern_chdir+0xc3
#4 0xffffffff8284e086 at unionfs_forward_vop_finish_pair+0x176
#12 0xffffffff8106a8a8 at amd64_syscall+0x158
#5 0xffffffff8285014d at unionfs_lookup+0x22d
#13 0xffffffff8103bd0b at fast_syscall_common+0xf8
panic: condition !VN_IS_DOOMED(dvp) not met at ../../../kern/vfs_cache.c:2481 (cache_enter_time)
cpuid = 2
time = 1720203945
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe010874e780
vpanic() at vpanic+0x13f/frame 0xfffffe010874e8b0
panic() at panic+0x43/frame 0xfffffe010874e910
cache_enter_time() at cache_enter_time+0x145b/frame 0xfffffe010874e9f0
unionfs_lookup() at unionfs_lookup+0x7fa/frame 0xfffffe010874eb50
VOP_CACHEDLOOKUP_APV() at VOP_CACHEDLOOKUP_APV+0x5f/frame 0xfffffe010874eb80
vfs_cache_lookup() at vfs_cache_lookup+0xa6/frame 0xfffffe010874ebd0
VOP_LOOKUP_APV() at VOP_LOOKUP_APV+0x5f/frame 0xfffffe010874ec00
vfs_lookup() at vfs_lookup+0x487/frame 0xfffffe010874ec80
namei() at namei+0x2d1/frame 0xfffffe010874ece0
kern_chdir() at kern_chdir+0xc3/frame 0xfffffe010874ee00
amd64_syscall() at amd64_syscall+0x158/frame 0xfffffe010874ef30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe010874ef30
--- syscall (12, FreeBSDdf55a, rsp = 0x3a114032c808, rbp = 0x3a114032c810 ---

https://people.freebsd.org/~pho/stress/log/log0535.txt

Jul 5 2024, 7:50 PM
jah added a comment to D45398: unionfs: rework locking scheme to only lock a single vnode.

@pho You may want to run unionfs19.sh against this patchset. I believe the "forward VOP" guards added in unionfs_lookup() will address the panic seen there.

Jul 5 2024, 4:18 PM
jah updated the diff for D45398: unionfs: rework locking scheme to only lock a single vnode.
  • Avoid leaking a locked+referenced vnode in the dot-dot lookup case
  • Avoid returning a doomed vnode in the dot-dot lookup case
  • Simplify a few locking-related checks in unionfs_lookup()
Jul 5 2024, 4:12 PM

Jul 2 2024

jah updated the diff for D45398: unionfs: rework locking scheme to only lock a single vnode.
  • Add and clarify some comments
Jul 2 2024, 10:10 PM

Jun 18 2024

jah updated the diff for D45398: unionfs: rework locking scheme to only lock a single vnode.
  • unionfs: do not create a new status object during vop_close()
  • Review feedback from olce@, fix for hung process reported by pho@
Jun 18 2024, 6:05 AM

Jun 14 2024

jah added inline comments to D45398: unionfs: rework locking scheme to only lock a single vnode.
Jun 14 2024, 3:49 PM
jah added inline comments to D45398: unionfs: rework locking scheme to only lock a single vnode.
Jun 14 2024, 2:02 AM

Jun 13 2024

jah added inline comments to D45398: unionfs: rework locking scheme to only lock a single vnode.
Jun 13 2024, 1:43 AM
jah added inline comments to D45398: unionfs: rework locking scheme to only lock a single vnode.
Jun 13 2024, 1:33 AM
jah added a comment to D45398: unionfs: rework locking scheme to only lock a single vnode.

Well, I'm almost done, but not completely (I need to finish reviewing unionfs_lock() and the lines below it). Here's a new round of comments/notes/requests.

Concerning unionfs_mkshadowdir(), I've added a series of suggested minor changes. These are mostly to clarify and simplify things (and, I suspect, marginally improve concurrency). You may as well forget about them if you think this is too much work for now (because, e.g., they would cause a new long round of testing).

The main problems I've spotted so far are the lack of some unlocks in unionfs_lookup() and the VV_ROOT case in unionfs_noderem() probably not handled correctly.

And, to correct something I've said:

The best approach, and perhaps the only sensible one, I have seen so far is that unionfs has to completely rely on lower layer filesystems for all operations, and for this must get rid of any cache it maintains from data on them.

Well, that's true, but only for information that can be modified by the underlying filesystems, which I had in mind when writing that. However, identity of the vnodes themselves is not included in these, so in fact a cache based on vnodes identities (and v_hash) is perfectly feasible, and will indeed allow to avoid the duplicate vnode problem in the lookup, and elsewhere. I'll also have to check for other foreseen uses developed in the full proposal.

Jun 13 2024, 12:43 AM

Jun 12 2024

jah added inline comments to D45398: unionfs: rework locking scheme to only lock a single vnode.
Jun 12 2024, 1:07 PM
jah added a comment to D45398: unionfs: rework locking scheme to only lock a single vnode.
In D45398#1039564, @pho wrote:

I ran a longer test with this patch and found this:

20240612 02:19:29 all (18/18): unionfs9.sh
ps -lp55088
UID   PID  PPID C PRI NI   VSZ  RSS MWCHAN  STAT TT     TIME COMMAND
  0 55088 37241 6  68  0 13088 2324 unioncp I+    0  0:00.01 find /mnt12 -type f -maxdepth 2 -ls
You have new mail.
root@mercat1:~ # procstat -k 55088
  PID    TID COMM                TDNAME              KSTACK                       
55088 100476 find                -                   mi_switch sleepq_switch sleepq_catch_signals sleepq_wait_sig _sleep unionfs_set_in_progress_flag unionfs_lookup VOP_CACHEDLOOKUP_APV vfs_cache_lookup VOP_LOOKUP_APV vfs_lookup namei kern_statat sys_fstatat amd64_syscall fast_syscall_common 
root@mercat1:~ #

https://people.freebsd.org/~pho/stress/log/log0526.txt

Jun 12 2024, 12:37 PM

May 29 2024

jah added a comment to D45398: unionfs: rework locking scheme to only lock a single vnode.

These changes have proven stable in the testing I've done so far, but I still see this as a somewhat rough patch.
I expect there will be plenty of review feedback, but my hope is that what I have here can at least be used as a starting point to fix unionfs locking.

May 29 2024, 5:03 AM
jah requested review of D45398: unionfs: rework locking scheme to only lock a single vnode.
May 29 2024, 4:51 AM

Apr 29 2024

jah committed rG05e8ab627bc6: unionfs_rename: fix numerous locking issues (authored by jah).
unionfs_rename: fix numerous locking issues
Apr 29 2024, 1:28 AM
jah closed D44788: unionfs_rename: fix numerous locking issues.
Apr 29 2024, 1:27 AM

Apr 27 2024

jah committed rGc8d6c9351a92: unionfs_lookup(): fix wild accesses to vnode private data (authored by jah).
unionfs_lookup(): fix wild accesses to vnode private data
Apr 27 2024, 5:47 PM

Apr 20 2024

jah added a comment to D44788: unionfs_rename: fix numerous locking issues.
In D44788#1022769, @jah wrote:
  • Fix fdvp lock recursion during file copy-up; use ERELOOKUP to simplify

Currently traveling without too much Internet connection, so not a thorough review, I'll try to come back to it this week-end.

I much prefer this version: Use of flags is indeed much clearer now, and ERELOOKUP simplifies things. I thought that @kib was referring to ERELOOKUP processed by vfs_lookup(), but it is also understood by kern_renameat(). I agree in general with this approach, although I don't think its current implementation in the generic code is satisfying (this should be documented in VOP_RENAME()'s contract; additionally, NFS directly calls VOP_RENAME() and doesn't support it; note to myself: we are lacking in consistency and reliability here).

Given the automatic shadow directory creation we do in unionfs_lookup(), I wonder if this code is really necessary. Of course it's always possible for another thread to concurrently delete the newly-created directory, but should we instead just ERELOOKUP in that case? Is there some other corner case I'm just not thinking of?

In the current framework, shadow directories are expected to be created at lookup, but this is something that will become configurable. IIRC, there are other places where unionfs doesn't try too hard to guarantee their existence, and simply fails with EROFS when there are not there (on the simplistic assumption that they couldn't be created at lookup time because the mount is read-only). I think however this is a liability, so would prefer that we keep that (and actually enforce this guarantee elsewhere, through proper modularization).

Instead of a simplified version of vfs_relookup(), would the ultimate solution instead be to follow your design and move copy-up (or placeholder creation) to a much earlier point in the VFS operation lifecycle? In other words, if the copy-up/placeholder operation could be done at a point close enough to the initial lookup that the parent directory vnode is still locked from that lookup, while the same lookup has also already determined that the child file does not exist on the upper FS, could the need to handle EEXIST (and therefore the need for relookup) be eliminated entirely?

Answer is similar here, it could be but ultimately, for reliability, the same copy-up code should exist in unionfs_rename() (and all these should be factorized in one place, as much as possible), at least in the current framework. Actually, for the directory rename case, it's even worse, since all files inside should be copied-up (whether plain copies or placeholders). In the end, to call VOP_RENAME() on the upper layer, you have to relock the destinations and unlock the source ones, there always will be a window of time where someone else could mess with your copy. This is a fundamental problem of the rename protocol (and probably, VFS locking in general). In the meantime, isn't it still better to keep preparing the upper layer as close as possible to the actual VOP_RENAME() call, as long as this doesn't complicate too much your work?

Apr 20 2024, 11:30 PM

Apr 19 2024

jah added inline comments to D44788: unionfs_rename: fix numerous locking issues.
Apr 19 2024, 2:50 AM
jah updated the diff for D44788: unionfs_rename: fix numerous locking issues.
  • Fix fdvp lock recursion during file copy-up; use ERELOOKUP to simplify
Apr 19 2024, 2:44 AM

Apr 17 2024

jah added inline comments to D44788: unionfs_rename: fix numerous locking issues.
Apr 17 2024, 9:14 PM

Apr 16 2024

jah added inline comments to D44788: unionfs_rename: fix numerous locking issues.
Apr 16 2024, 2:33 AM

Apr 15 2024

jah added a comment to D44788: unionfs_rename: fix numerous locking issues.

The main problem that I see with these changes is that they lead to dropping support for FSes with non-recursive locking (see some of the inline comments). I think there are also some minor problems in the locking/relookup logic (see the inline comments as well).

Besides, unionfs_rename() still has numerous problems beyond locking, and I'm wondering if it's worth it for everybody to pursue into this direction before I've started the unionfs project, which will include an overhaul of its fundamentals. It will probably take less time to mostly rewrite it from there then to try to fix all these deficiencies, especially given that the most fundamental ones are not readily visible even with runs of stress2.

In D44788#1020876, @jah wrote:
In D44788#1020860, @kib wrote:

Could you try to (greatly) simplify unionfs rename by using ERELOOKUP? For instance, it can be split into two essentially independent cases: 1. need to copy fdvp from lower to upper (and return ERELOOKUP) 2. Just directly call VOP_RENAME() on upper if copy is not needed.

Splitting the need to copy-up before calling VOP_RENAME() is a necessity, independently of ERELOOKUP, to be able to restart/cancel an operation that is interrupted (by, e.g., a system crash). With ERELOOKUP, part of the code should go into or be called from unionfs_lookup() instead. I doubt this will simplify things per se, i.e., more than extracting the code to a helper function would do. Later on, as placeholders are implemented, no such copy should even be necessary, which makes apparent that unionfs_lookup() is not a good place to make that decision/undertake that action.

I think that's a good idea; ERELOOKUP is probably what we really want to use in most (all?) of the cases in which we currently use unionfs_relookup_*. There will be some penalty for making the vfs_syscall layer re-run the entire lookup instead of re-running only the last level, but those cases are never on the fast path anyway.

ERELOOKUP restarts the lookup at the latest reached directory, so not sure which penalty you are talking about.

Apr 15 2024, 8:36 PM

Apr 14 2024

jah added a comment to D44788: unionfs_rename: fix numerous locking issues.
In D44788#1020860, @kib wrote:

Could you try to (greatly) simplify unionfs rename by using ERELOOKUP? For instance, it can be split into two essentially independent cases: 1. need to copy fdvp from lower to upper (and return ERELOOKUP) 2. Just directly call VOP_RENAME() on upper if copy is not needed.

Apr 14 2024, 3:00 PM

Apr 13 2024

jah added inline comments to D44788: unionfs_rename: fix numerous locking issues.
Apr 13 2024, 10:44 PM
jah requested review of D44788: unionfs_rename: fix numerous locking issues.
Apr 13 2024, 10:37 PM

Apr 9 2024

jah committed rGb18029bc59d2: unionfs_lookup(): fix wild accesses to vnode private data (authored by jah).
unionfs_lookup(): fix wild accesses to vnode private data
Apr 9 2024, 10:38 PM
jah closed D44601: unionfs_lookup(): fix wild accesses to vnode private data.
Apr 9 2024, 10:38 PM

Apr 7 2024

jah committed rG7b86d14bfccb: unionfs: implement VOP_UNP_* and remove special VSOCK vnode handling (authored by jah).
unionfs: implement VOP_UNP_* and remove special VSOCK vnode handling
Apr 7 2024, 12:32 AM

Apr 3 2024

jah added a comment to D44601: unionfs_lookup(): fix wild accesses to vnode private data.

These changes plug holes indeed.

Side note: It's likely that I'll rewrite the whole lookup code at some point, so I'll have to test again for races. The problem with this kind of bugs is that they are triggered only by rare races. We already have stress2 which is great, but also relies on "chance". This makes me think that perhaps we could have some more systematic framework triggering vnode dooming, let's say, at unlock. I'll probably explore that at some point.

Apr 3 2024, 1:48 PM

Apr 2 2024

jah requested review of D44601: unionfs_lookup(): fix wild accesses to vnode private data.
Apr 2 2024, 10:44 PM

Mar 24 2024

jah committed rG61d9b0cb38bb: uipc_bindat(): Explicitly specify exclusive locking for the new vnode (authored by jah).
uipc_bindat(): Explicitly specify exclusive locking for the new vnode
Mar 24 2024, 3:06 AM
jah committed rG6d118b958612: unionfs: accommodate underlying FS calls that may re-lock (authored by jah).
unionfs: accommodate underlying FS calls that may re-lock
Mar 24 2024, 3:05 AM
jah committed rGb09b120818a8: vn_lock_pair(): allow lkflags1/lkflags2 to be 0 if vp1/vp2 is NULL (authored by jah).
vn_lock_pair(): allow lkflags1/lkflags2 to be 0 if vp1/vp2 is NULL
Mar 24 2024, 3:05 AM
jah committed rGeee6217b40df: unionfs: implement VOP_UNP_* and remove special VSOCK vnode handling (authored by jah).
unionfs: implement VOP_UNP_* and remove special VSOCK vnode handling
Mar 24 2024, 2:13 AM
jah closed D44288: unionfs: implement VOP_UNP_* and remove special VSOCK vnode handling.
Mar 24 2024, 2:12 AM

Mar 16 2024

jah updated the diff for D44288: unionfs: implement VOP_UNP_* and remove special VSOCK vnode handling.

Code review feedback, also remove a nonsensical check from unionfs_link()

Mar 16 2024, 3:49 PM

Mar 10 2024

jah requested review of D44288: unionfs: implement VOP_UNP_* and remove special VSOCK vnode handling.
Mar 10 2024, 2:10 AM
jah committed rG6c8ded001540: unionfs: accommodate underlying FS calls that may re-lock (authored by jah).
unionfs: accommodate underlying FS calls that may re-lock
Mar 10 2024, 1:59 AM
jah closed D44076: unionfs: accommodate underlying FS calls that may re-lock.
Mar 10 2024, 1:59 AM
jah closed D44047: uipc_bindat(): Explicitly specify exclusive locking for the new vnode.
Mar 10 2024, 1:53 AM
jah committed rGd56c175ac935: uipc_bindat(): Explicitly specify exclusive locking for the new vnode (authored by jah).
uipc_bindat(): Explicitly specify exclusive locking for the new vnode
Mar 10 2024, 1:53 AM
jah committed rGfa26f46dc29f: vn_lock_pair(): allow lkflags1/lkflags2 to be 0 if vp1/vp2 is NULL (authored by jah).
vn_lock_pair(): allow lkflags1/lkflags2 to be 0 if vp1/vp2 is NULL
Mar 10 2024, 1:47 AM
jah closed D44046: vn_lock_pair(): only assert on lkflags1/lkflags2 vp1/vp2 is non-NULL.
Mar 10 2024, 1:46 AM

Mar 4 2024

jah committed rG5e806288f0c7: unionfs: cache upper/lower mount objects (authored by jah).
unionfs: cache upper/lower mount objects
Mar 4 2024, 6:52 PM
jah committed rG9c530578757b: unionfs: upgrade the vnode lock during fsync() if necessary (authored by jah).
unionfs: upgrade the vnode lock during fsync() if necessary
Mar 4 2024, 6:52 PM
jah committed rGc18e6a5a5c63: unionfs: work around underlying FS failing to respect cn_namelen (authored by jah).
unionfs: work around underlying FS failing to respect cn_namelen
Mar 4 2024, 6:52 PM
jah committed rGd0bb255d1fcb: VFS: update VOP_FSYNC() debug check to reflect actual locking policy (authored by jah).
VFS: update VOP_FSYNC() debug check to reflect actual locking policy
Mar 4 2024, 6:52 PM

Feb 29 2024

jah updated the diff for D44076: unionfs: accommodate underlying FS calls that may re-lock.

Incorporate code review feedback from olce@

Feb 29 2024, 5:23 AM

Feb 24 2024

jah added a comment to D44076: unionfs: accommodate underlying FS calls that may re-lock.

This basically amounts to a generalized version of the mkdir()-specific fix I made last year in commit 93fe61afde72e6841251ea43551631c30556032d (of course in that commit I also inadvertently added a potential v_usecount ref leak on the new vnode). Or I guess it can be thought of as a tailored version of null_bypass().

Feb 24 2024, 11:59 PM
jah requested review of D44076: unionfs: accommodate underlying FS calls that may re-lock.
Feb 24 2024, 11:57 PM

Feb 23 2024

jah updated the diff for D44047: uipc_bindat(): Explicitly specify exclusive locking for the new vnode.

Only clear LK_SHARED

Feb 23 2024, 11:39 PM
jah updated the diff for D44046: vn_lock_pair(): only assert on lkflags1/lkflags2 vp1/vp2 is non-NULL.

Only allow lkflags to be 0 when the corresponding vnode is NULL

Feb 23 2024, 11:39 PM
jah added a comment to D44046: vn_lock_pair(): only assert on lkflags1/lkflags2 vp1/vp2 is non-NULL.
In D44046#1004894, @kib wrote:
In D44046#1004891, @jah wrote:
In D44046#1004890, @kib wrote:

So might be just allow zero flags if corresponding vp is NULL?

Sure, we could do that, but I'm curious: is there some reason why we should care what the lockflags are if there is no vnode to lock? What I have here seems more straightforward than making specific allowances for NULL vnodes.

I am about the reliable checking for the API contracts. Assume that some function calls vn_lock_pair() with externally-specified flags, and corresponding vp could be NULL sometimes. I want such calls to always have correct flags, esp. if vp != NULL is rare or could not be easily checked by normal testing.

Feb 23 2024, 8:58 PM
jah added a comment to D44046: vn_lock_pair(): only assert on lkflags1/lkflags2 vp1/vp2 is non-NULL.
In D44046#1004890, @kib wrote:

So might be just allow zero flags if corresponding vp is NULL?

Feb 23 2024, 6:09 PM
jah added a comment to D44046: vn_lock_pair(): only assert on lkflags1/lkflags2 vp1/vp2 is non-NULL.
In D44046#1004867, @kib wrote:

May I ask why? This allows to pass any flags for null vnodes.

Feb 23 2024, 6:02 PM
jah added a comment to D44047: uipc_bindat(): Explicitly specify exclusive locking for the new vnode.

This is needed for upcoming work to adopt VOP_UNP_* in unionfs.

Feb 23 2024, 5:56 PM
jah requested review of D44047: uipc_bindat(): Explicitly specify exclusive locking for the new vnode.
Feb 23 2024, 5:55 PM
jah requested review of D44046: vn_lock_pair(): only assert on lkflags1/lkflags2 vp1/vp2 is non-NULL.
Feb 23 2024, 5:50 PM

Feb 18 2024

jah closed D43818: unionfs: workaround underlying FS failing to respect cn_namelen.
Feb 18 2024, 3:20 PM
jah committed rGa2ddbe019d51: unionfs: work around underlying FS failing to respect cn_namelen (authored by jah).
unionfs: work around underlying FS failing to respect cn_namelen
Feb 18 2024, 3:20 PM
jah closed D43817: unionfs: upgrade the vnode lock during fsync() if necessary.
Feb 18 2024, 3:18 PM
jah committed rG2656fc29be8b: unionfs: upgrade the vnode lock during fsync() if necessary (authored by jah).
unionfs: upgrade the vnode lock during fsync() if necessary
Feb 18 2024, 3:18 PM
jah committed rG9530182e371d: VFS: update VOP_FSYNC() debug check to reflect actual locking policy (authored by jah).
VFS: update VOP_FSYNC() debug check to reflect actual locking policy
Feb 18 2024, 3:17 PM
jah closed D43816: VFS: update VOP_FSYNC() debug check to reflect actual locking policy.
Feb 18 2024, 3:16 PM
jah committed rGcc3ec9f75978: unionfs: cache upper/lower mount objects (authored by jah).
unionfs: cache upper/lower mount objects
Feb 18 2024, 3:15 PM
jah closed D43815: unionfs: cache upper/lower mount objects.
Feb 18 2024, 3:15 PM

Feb 13 2024

jah added a comment to D43815: unionfs: cache upper/lower mount objects.
In D43815#1000687, @jah wrote:
In D43815#1000340, @jah wrote:

I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.

Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.

That's not true; vref() does nothing to prevent a forced unmount from dooming the vnode, only holding its lock does this. As such, if the lock needs to be transiently dropped for some reason and the timing is sufficiently unfortunate, the concurrent recursive forced unmount can first unmount unionfs (dooming the unionfs vnode) and then the base FS (dooming the lower/upper vnode). The held references prevent the vnodes from being recycled (but not doomed), but even this isn't foolproof: for example, in the course of being doomed, the unionfs vnode will drop its references on the lower/upper vnodes, at which point they may become unreferenced unless additional action is taken. Whatever caller invoked the unionfs VOP will of course still hold a reference on the unionfs vnode, but this does not automatically guarantee that references will be held on the underlying vnodes for the duration of the call, due to the aforementioned scenario.

There is a misunderstanding. I'm very well aware of what you are saying, as you should know. But this is not my point, which concerns the sentence "Use of [vnode]->v_mount is unsafe in the presence of a concurrent forced unmount." in the context of the current change. The bulk of the latter is modifications of unionfs_vfsops.c, which contains VFS operations, and not vnode ones. There are no vnodes involved there, except accessing the layers' root ones. And what I'm saying, and that I proved above is that v_mount on these, again in the context of a VFS operation, cannot be NULL because of a force unmount (if you disagree, then please show where you think there is a flaw in the reasoning).

Actually the assertion about VFS operations isn't entirely true either (mostly, but not entirely); see the vfs_unbusy() dance we do in unionfs_quotactl().
But saying this makes me realize I actually need to bring back the atomic_load there (albeit the load should be of ump->um_uppermp now).

Otherwise your assertion should be correct, and indeed I doubt the two read-only VOPs in question would have these locking issues in practice.
I think the source of the misunderstanding here is that I just didn't word the commit message very well. Really what I meant there is what I said in a previous comment here: If we need to cache the mount objects anyway, it's better to use them everywhere to avoid the pitfalls of potentially accessing ->v_mount when it's unsafe to do so.

Feb 13 2024, 1:45 PM
jah updated the diff for D43815: unionfs: cache upper/lower mount objects.

Restore volatile load from ump in quotactl()

Feb 13 2024, 12:30 PM
jah added a comment to D43815: unionfs: cache upper/lower mount objects.
In D43815#1000340, @jah wrote:

I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.

Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.

That's not true; vref() does nothing to prevent a forced unmount from dooming the vnode, only holding its lock does this. As such, if the lock needs to be transiently dropped for some reason and the timing is sufficiently unfortunate, the concurrent recursive forced unmount can first unmount unionfs (dooming the unionfs vnode) and then the base FS (dooming the lower/upper vnode). The held references prevent the vnodes from being recycled (but not doomed), but even this isn't foolproof: for example, in the course of being doomed, the unionfs vnode will drop its references on the lower/upper vnodes, at which point they may become unreferenced unless additional action is taken. Whatever caller invoked the unionfs VOP will of course still hold a reference on the unionfs vnode, but this does not automatically guarantee that references will be held on the underlying vnodes for the duration of the call, due to the aforementioned scenario.

There is a misunderstanding. I'm very well aware of what you are saying, as you should know. But this is not my point, which concerns the sentence "Use of [vnode]->v_mount is unsafe in the presence of a concurrent forced unmount." in the context of the current change. The bulk of the latter is modifications of unionfs_vfsops.c, which contains VFS operations, and not vnode ones. There are no vnodes involved there, except accessing the layers' root ones. And what I'm saying, and that I proved above is that v_mount on these, again in the context of a VFS operation, cannot be NULL because of a force unmount (if you disagree, then please show where you think there is a flaw in the reasoning).

Feb 13 2024, 12:26 PM

Feb 12 2024

jah added a comment to D43815: unionfs: cache upper/lower mount objects.
In D43815#1000214, @jah wrote:

If one of the layer if forcibly unmounted, there isn't much point in continuing operation. But, given the first point above, that cannot even happen. So really the only case when v_mount can get NULL is the ZFS rollback's one (the layers' root vnodes can't be recycled since they are vrefed). Thinking more about it, always testing if these are alive and well is going to be inevitable going forward. But I'm fine with this change as it is for now.

This can indeed happen, despite the first point above. If a unionfs VOP ever temporarily drops its lock, another thread is free to stage a recursive forced unmount of both the unionfs and the base FS during this window. Moreover, it's easy for this to happen without unionfs even being aware of it: because unionfs shares its lock with the base FS, if a base FS VOP (forwarded by a unionfs VOP) needs to drop the lock temporarily (this is common e.g. for FFS operations that need to update metadata), the unionfs vnode may effectively be unlocked during that time. That last point is a particularly dangerous one; I have another pending set of changes to deal with the problems that can arise in that situation.

This is why I say it's easy to make a mistake in accessing [base vp]->v_mount at an unsafe time.

I don't think it can. Given the first point above, there can't be any unmount of some layer (even forced) until the unionfs mount on top is unmounted. As the layers' root vnodes are vrefed(), they can't become doomed (since unmount of their own FS is prevented), and consequently their v_mount is never modified (barring the ZFS rollback case). This is independent of holding (or not) any vnode lock.

Which doesn't say that they aren't any problems of the sort that you're reporting in unionfs, it's just a different matter.

Feb 12 2024, 6:34 PM
jah added a comment to D43815: unionfs: cache upper/lower mount objects.
In D43815#999937, @jah wrote:

Well, as it is today unmounting of the base FS is either recursive or it doesn't happen at all (i.e. the unmount attempt is rejected immediately because of the unionfs stacked atop the mount in question). I don't think it can work any other way, although I could see the default settings around recursive unmounts changing (maybe vfs.recursive_forced_unmount being enabled by default, or recursive unmounts even being allowed for the non-forced case as well). I don't have plans to change any of those defaults though.

I was asking because I was fearing that the unmount could proceed in the non-recursive case, but indeed it's impossible (handled by the !TAILQ_EMPTY(&mp->mnt_uppers) test in dounmount()). For the default value itself, for now I think it is fine as it is (prevents unwanted foot-shooting).

For the changes here, you're right that the first reason isn't an issue as long as the unionfs vnode is locked when the [base_vp]->v_mount access happens, as the unionfs unmount can't complete while the lock is held which then prevents the base FS from being unmounted. But it's also easy to make a mistake there, e.g. in cases where the unionfs lock is temporarily dropped, so if the base mount objects need to be cached anyway because of the ZFS case then it makes sense to just use them everywhere.

If one of the layer if forcibly unmounted, there isn't much point in continuing operation. But, given the first point above, that cannot even happen. So really the only case when v_mount can get NULL is the ZFS rollback's one (the layers' root vnodes can't be recycled since they are vrefed). Thinking more about it, always testing if these are alive and well is going to be inevitable going forward. But I'm fine with this change as it is for now.

Feb 12 2024, 4:43 PM
jah added a comment to D43818: unionfs: workaround underlying FS failing to respect cn_namelen.
In D43818#999955, @olce wrote:

OK as a workaround. Hopefully, we'll get OpenZFS fixed soon. If you don't plan to, I may try to submit a patch upstream, since it seems no one has proposed any change in https://github.com/openzfs/zfs/issues/15705.

Feb 12 2024, 12:34 AM
jah added a comment to D40850: VFS lookup: New vn_cross_single_mount() and vn_cross_mounts().

@olce @mjg This change seems to have stalled, what do you want to do about it?

Feb 12 2024, 12:32 AM

Feb 11 2024

jah added a comment to D43815: unionfs: cache upper/lower mount objects.
In D43815#999912, @olce wrote:

I think this goes in the right direction long term also.

Longer term, do you have any thoughts on only supporting recursive unmounting, regardless of whether forced or not? This would eliminate the first reason evoked in the commit message.

Feb 11 2024, 6:22 PM
jah updated the diff for D43815: unionfs: cache upper/lower mount objects.

Style

Feb 11 2024, 7:12 AM