Page MenuHomeFreeBSD

nullfs: busy the lower mount before calling VFS_* operations on it

Authored by jah on May 14 2021, 5:17 PM.



In the case of forced unmount of the lower filesystem, the lower
mount may be destroyed while nullfs is still mounted above it.
Further, since struct mount is type-stable, the lower mount may
also be recycled for a different FS while the nullfs instance
still holds a pointer to it.

Instead, always always access the lower mount through the v_mount
field of the lower root vnode. Since nullfs holds a reference to that
vnode, it will not be recycled to a different mount and its v_mount
field will always be either the valid lower mount or NULL.
After obtaining the lower mount, attempt to busy it and handle failure

Diff Detail

rS FreeBSD src repository - subversion
Lint OK
No Unit Test Coverage
Build Status
Buildable 39194
Build 36083: arc lint + arc unit

Event Timeline

jah requested review of this revision.May 14 2021, 5:17 PM

I do not think this is useful, because the lowerrootvp vnode is locked, so unmount cannot proceed past vflush().

In fact, if not MBF_NOWAIT used in nullfs_mount_trybusy(), then this would be a LoR. But I am not sure that reporting transient failures from MBF_NOWAIT is the right choice. Idea with the non-forced unmount is that it should be invisible for the userspace. E.g. automount does periodic probes of its filesystems with non-forced unmount, and userspace should not see transient errors due to probing.


Again, is it needed?

Caller of VFS_VGET() must ensure that mp is live for the duration of the call. Then, since nullfs is registered in mnt_uppers of lowermp, it should prevent lowermp' unmount from even starting.


Ok. For the other VFS operations besides mount, would it be preferable to drop MBF_NOWAIT and allow vfs_busy() to block?
I avoided that out of concern for possible deadlock, and it didn't seem necessary given that failure would only happen with a pending concurrent unmount. But I hadn't considered something like the automount case.


Registration in mnt_uppers won't happen in the nocache case.


Specifically the concern I had regarding deadlock would be if there could ever be a case in which the lower mount would be busied on the call here, then an umount would happen on another thread, followed by our attempt to busy the mount. So basically the same thing that led to unbusying requirements around quota file I/O in VFS_QUOTACTL().

The same concern would apply both here and in the corresponding unionfs change. But for both nullfs and unionfs, would it instead be better to simply prevent even forced unmount of any lower FS while the upper is mounted? mnt_uppers wouldn't work as-is for unionfs, and it seems to imply some extra functionality anyway, but perhaps something like a new mnt_kern_flag ?


(Or rather, a mnt_upper_count or mnt_hold_count field in struct mount, since a mount can have multiple upper mounts)


So why not register nullfs upper mounts unconditionally? Of course, VFS_NOTIFY_UPPER_XXX callbacks would do nothing.

That said, I do not think that any busying is needed for nullfs_mount, because liveness of the lowervp provides the strongest guarantee.


I agree, busying shouldn't be needed for nullfs_mount(). I'm more concerned with the other vfs_* entrypoints, and also the corresponding entrypoints in unionfs.

I'm hesitant to use mnt_uppers to keep the lower mount from going away for two reasons:

  1. The mnt_uppers list seems to be intended as a notification mechanism for managing the nullfs vnode cache, rather than a way to keep the lower mount pinned. I'd rather avoid confusing those 2 goals if possible.
  1. The same approach won't work for unionfs, as it assumes no more than 1 lower mount with the list linkage stored in upper->mnt_upper_link.

I added the check for TAILQ_EMPTY(&mp->mnt_uppers) from the moment when the list was introduced. AFAIR, the intent was to prevent panics that plagued the portbuild cluster when people unmounted lower forgetting about nullfs mounts.

I believe it is simpler and more fit the nullfs code. For unionfs, if this approach is not adequate, lets develop something else.

Busy is tricky because

  1. taking two busy refs from the same thread is deadlock-prone, as you noted already. In particular, namei() must not be used while owning the ref.
  2. busy ref is before all vnode locks in the corresponding mp

Why not use my earlier suggestion to add a counter to struct mount?

The changes here and in D30152 would be replaced with a much simpler set of changes that would:

  1. Add a mnt_pinned_count field to struct mount
  2. Stacked filesystems would increment this field for their lower filesystem(s) in their .vfs_mount callback and decrement it in .vfs_unmount, while holding the lower ilock. unionfs would do this for both "lower" filesystems.
  3. The TAILQ_EMPTY(&mp->mnt_uppers) check in dounmount would be replaced with a check of "mnt_pinned_count != 0"

This would work for both nullfs and unionfs, and wouldn't cause any unnecessary VFS_NOTIFY_UPPER_* callbacks in the nocache case for nullfs.

What it wouldn't do is allow unionfs to participate in the VFS_NOTIFY_UPPER_ * scheme, but I think unionfs has bigger issues to be fixed before worrying about that.


But would the increment of mnt_pinned_count blocked if unmount is already started? If yes, it is yet another busy count, if not, what prevents pinned to go from 0 to 1 after unmount drained it?

Intuitively, we have a lot of draining counters already, busy is drained before unmount starts, and mnt_ref after unmount but before struct mount can be reused, and write_count is drained by some VFS_UNMOUNT implementations after busy but before they start to tramp down the unwritten changes. So, again intuitively, it seems excessive to add yet another counter, since it would duplicate some existing one. [Let's put unionfs aside for a moment]

BTW, for purpose of returning EBUSY on unmount, we do not need to take the mount interlock. We also check mnt_uppers unlocked first.


I would not want the increment to block. I would want it to fail if unmount is already started, that is if lowermp->mnt_kern_flag & MNTK_UNMOUNT != 0, also checked within the same ilock section as the increment of mnt_pinned_count. Since the increment only happens on mount of the upper filesystem, the mount attempt shouldn't be allowed if there is already a pending unmount request, even if that request doesn't ultimately succeed.

In dounmount(), mnt_uppers is checked while holding the ilock, just before setting MNTK_UNMOUNT. dounmount_cleanup() drops the ilock if we decide to return EBUSY.


Ok, lets go with this direction.