Page MenuHomeFreeBSD

Allow stacked filesystems to be recursively unmounted
ClosedPublic

Authored by jah on Jul 4 2021, 2:11 AM.

Details

Summary

In certain emergency cases such as media failure or removal, UFS will
initiate a forced unmount in order to prevent dirty buffers from
accumulating against the no-longer-usable filesystem. The presence
of a stacked filesystem such as nullfs or unionfs above the UFS mount
will prevent this forced unmount from succeeding.

This change addreses the situation by allowing stacked filesystems to
be recursively unmounted on a taskqueue thread when the MNT_RECURSE
flag is specified to dounmount(). This call will block until all upper
mounts have been removed unless the caller specifies the MNT_TASKQUEUE
flag to indicate the base filesystem should also be unmounted from the
taskqueue.

To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs
have been combined with the existing 'mnt_uppers' list used by nullfs
and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper().
The format of the mnt_uppers list has also been changed to accommodate
filesystems such as unionfs in which a given mount may be stacked atop
more than one lower mount. Additionally, management of lower FS
reclaim/unlink notifications has been split into a separate list
managed by a separate set of KPIs, as registration of an upper FS no
longer implies interest in these notifications.

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

jah requested review of this revision.Jul 4 2021, 2:11 AM

This is a first draft of the change; I've tested it enough to verify that it works, but I'd like feedback on the basic approach as well as some specific questions I have below.

sys/kern/vfs_subr.c
3926

I didn't see an obvious reason for the use of these marker nodes. Were they intended to keep the lower FS from being unmounted during an in-progress notification by virtue of mnt_uppers being non-empty, or is there some other reason they're needed? If we no longer need them, I should be able to clean this code up quite a bit.

sys/ufs/ffs/ffs_vfsops.c
300

It seems like we should be able to clean the FFS error handling code up quite a bit by just directly issuing dounmount with MNT_FORCE | MNT_RECURSE | MNT_TASKQUEUE from the failing context. That would allow us to get rid of the fsfail_task construct and would also avoid blocking the global taskqueue_thread with a potentially-expensive recursive unmount.

sys/kern/vfs_mount.c
2026

Suppose that this delayed unmount becomes popular enough, and one of our upper mounts also has upper mounts. Wouldn't the system deadlock?

sys/kern/vfs_subr.c
3926

They are used in the current code because iteration is protected by mount interlock, but we obviously have to drop the interlock for the notification call. This allows modification of the tailq, and we need to regain the iteration position after the call.

sys/kern/vfs_mount.c
2026

This is the case where the "base" filesystem is being unmounted, and the calling thread has requested for it to be done synchronously, i.e. MNT_TASKQUEUE is not present in flags. The uppers, and the uppers of uppers etc., will all be unmounted from the taskqueue context with MNT_TASKQUEUE and won't block here. I've tested recursively unmounting ufs->unionfs->nullfs and it does work.

Given a stacked filesystem hierarchy A->B->C, there could also be a case in which both 'A' and 'B' could be requested to synchronously (and recursively) unmount, but those requests would need to happen in different threads, neither of which could be the taskqueue.

Is there a specific scenario you're concerned about?

sys/kern/vfs_subr.c
3926

I think that's the part I don't understand; uppers can't be removed from the queue during this window, they can only be added at the tail. That means they'll be added somewhere after the marker and picked up by the iteration anyway. Is there some other problem I've missed?

sys/kern/vfs_mount.c
2026

I mean, you delegated one unmount to the taskqueue, and now taskqueue executes it. If this delegated unmount needs to delegate another unmount, it schedules a task and sleeps waiting for the task to finish. But it sleeps in the context of the taskqueue which should execute that another unmount, so the new task basically never picked up for execution.

sys/kern/vfs_subr.c
3926

What prevents a removal from the list?

sys/kern/vfs_mount.c
2026

The delegated unmount doesn't sleep though. Since it was issued by the taskqueue, it will have MNT_TASKQUEUE in 'flags' and instead will just requeue itself at the end of recursive unmount queue and return EINPROGRESS.

sys/kern/vfs_subr.c
3926

vfs_drain_upper_locked(), which is called by both vfs_unregister_upper() and vfs_unregister_for_notification().
Even before this change, nullfs waited for MNTK_VGONE_UPPER to drain, although it doesn't seem like that would've worked with more than one in-progress notification.

sys/ufs/ffs/ffs_vfsops.c
300

I concur with your suggestion.

--Make forced taskqueue unmount operations resilient to spurious failure

A concurrent unmount attempt from another thread (which may fail) or a
concurrent update mount may induce a spurious failure.  Allow the
tasqueue to requeue the unmount request in these cases.

--Fix refcounting for the case when amother thread requests deferred

unmount of a base filesystem.

--Add locking annotations for new fields

Simplify by requiring taskqueue unmounts to be recursive

@jah - are you ready to have Peter Holm test these changes? If so, I will enlist his help.

kib added inline comments.
sys/kern/vfs_mount.c
1957

There should be a blank line before multi-line comment, and after the end of the code which is described by it.

This revision is now accepted and ready to land.Jul 15 2021, 3:58 PM

@jah - are you ready to have Peter Holm test these changes? If so, I will enlist his help.

Yes, in fact I've been using the unionfs/nullfs stress2 tests as the basis for crude manual testing of this code, but it would be nice to have something automated.
It'd be nice to have an automated that exercises recursive unmounts concurrently with both other recursive and non-recursive unmounts within the same stacked filesystem hierarchy.

I've added Peter here. I gladly help write any needed tests, but I could use some guidance on what we might already have that could be a starting point.

sys/kern/vfs_subr.c
3926

@kib can the code for marker nodes safely be removed?

sys/ufs/ffs/ffs_vfsops.c
300

Is there a good way to test the fsfail case? I've shied away from doing the cleanup mentioned here without a good way to test it.

In D31016#702230, @jah wrote:

@jah - are you ready to have Peter Holm test these changes? If so, I will enlist his help.

Yes, in fact I've been using the unionfs/nullfs stress2 tests as the basis for crude manual testing of this code, but it would be nice to have something automated.
It'd be nice to have an automated that exercises recursive unmounts concurrently with both other recursive and non-recursive unmounts within the same stacked filesystem hierarchy.

I've added Peter here. I gladly help write any needed tests, but I could use some guidance on what we might already have that could be a starting point.

I'm running tests with D31016.92185.diff right now.

Suggestion on how to run forcible unmount test on UFS.

sys/ufs/ffs/ffs_vfsops.c
300

Is there a good way to test the fsfail case? I've shied away from doing the cleanup mentioned here without a good way to test it.

Peter Holm has some good tests that start up activity on a UFS filesystem and then kill the underlying disk device.
See tools/test/stress2/misc/gnop8.sh as a prototype for how to run the test.

sys/kern/vfs_subr.c
3926

Yes it can be removed now, I agree.

Fix whitespace, reap MNTK_MARKER, remove fsfail_task and add associated test

This revision now requires review to proceed.Jul 19 2021, 2:19 AM

Suggestion on how to run forcible unmount test on UFS.

Thanks! I've added the cleanup of fsfail_task to this review, along with the 'gnop11.sh' test I cobbled together to exercise it.

I am fine with the patch in review, but think that the actual commits should be split. In particular:

  • new test
  • removal of the task from UFS code
  • removal of the marker

needs to be a separate commits, at least.

This revision is now accepted and ready to land.Jul 19 2021, 11:21 AM
sys/sys/mount.h
261

All the other struct mount fields have a "mnt_" prefix, these new ones ought to have that prefix as well.

440

You should update this comment with the names of any new flags that are now valid to use with unmount(). It looks like MNT_BYFSID can already be used with unmount, so that flag should be listed here too.

456

The MNT_* flags are part of the syscall interface, and the name of an interface flag ought to describe what it does rather than how it does it. This is really about making the unmount asynchronous, and the fact that a taskqueue is the kernel abstraction that is used to make the operation asynchronous is just an internal detail. I would suggest "MNT_ASYNC" as the name, but I see there's already a flag with that name. However, MNT_ASYNC is currently only used for mount and not for unmount, so there would be no ambiguity to use the existing MNT_ASYNC flag in this new context.

Or are these new flags not intended to be part of the syscall interface? I don't see any change here to add support in the umount command for using these new flags.

This revision now requires review to proceed.Jul 19 2021, 5:52 PM

Split into multiple commits

Prefix new fields with mnt_, MNT_TASKQUEUE -> MNT_DEFERRED

sys/sys/mount.h
261

Done.

456

Good point, I've renamed MNT_TASKQUEUE to MNT_DEFERRED. That said, these new flags aren't intended to be exposed through the sysctl interface. I wouldn't object to exposing MNT_RECURSE through something like a '-r' option to umount(8), but I see no reason to ever expose MNT_DEFERRED to userspace.

This reminds me of something that occurred to me while I was writing the code though: these flags are effectively prevented from being exposed to userspace because they're outside the 32-bit range of unmount_args->flags.
It seems a bit fragile to rely on that though...should we have something like an MNT_KERNEL mask for flags that are only allowed to be passed from kernel-initiated operations, and return EINVAL for any userspace request that includes any of them?

This reminds me of something that occurred to me while I was writing the code though: these flags are effectively prevented from being exposed to userspace because they're outside the 32-bit range of unmount_args->flags.
It seems a bit fragile to rely on that though...should we have something like an MNT_KERNEL mask for flags that are only allowed to be passed from kernel-initiated operations, and return EINVAL for any userspace request that includes any of them?

In theory MNT_VISFLAGMASK are externally visible and MNT_CMDFLAGS are internal only. The MNT_VISFLAGMASK flags are the only ones returned in a statfs(2) call but are not checked on system calls as far as I can tell.
Adding some clarity / enforcement would likely be a good idea, but not as part of these changes.

I am fine with the structure of this patch. I do ask that you await passing Peter Holm's tests before committing.

This revision is now accepted and ready to land.Jul 19 2021, 11:42 PM

I am fine with the structure of this patch. I do ask that you await passing Peter Holm's tests before committing.

I have started a full test with D31016.92436.diff.

In D31016#703154, @pho wrote:

I am fine with the structure of this patch. I do ask that you await passing Peter Holm's tests before committing.

I have started a full test with D31016.92436.diff.

All done. No problems seen.

20210722 19:00:47 all.sh done, elapsed 2 day(s), 04:40.10
In D31016#704447, @pho wrote:

All done. No problems seen.

20210722 19:00:47 all.sh done, elapsed 2 day(s), 04:40.10

And no nullfs issues observed? I am confused.

In D31016#704507, @kib wrote:
In D31016#704447, @pho wrote:

All done. No problems seen.

20210722 19:00:47 all.sh done, elapsed 2 day(s), 04:40.10

And no nullfs issues observed? I am confused.

Yes, sorry. What I should have stated was that all of the stress2 tests were run, excluding known problem tests.
Specifically the force4.sh test, a nullfs / mdconfig -o force test, was excluded. This test failed before this patch and also fails with this patch.
I have lately tested a separate patch (by you) combined with D31016.92436.diff that fixes the issue seen with the force4.sh test. This combo is still being tested.

In D31016#704531, @pho wrote:
In D31016#704507, @kib wrote:
In D31016#704447, @pho wrote:

All done. No problems seen.

20210722 19:00:47 all.sh done, elapsed 2 day(s), 04:40.10

And no nullfs issues observed? I am confused.

Yes, sorry. What I should have stated was that all of the stress2 tests were run, excluding known problem tests.
Specifically the force4.sh test, a nullfs / mdconfig -o force test, was excluded. This test failed before this patch and also fails with this patch.
I have lately tested a separate patch (by you) combined with D31016.92436.diff that fixes the issue seen with the force4.sh test. This combo is still being tested.

Peter, thank you for testing this out!

Do you think there's any value in committing gnop11.sh as-is? I know you said the I/O completed too quickly on your faster machine; would you prefer a shorter sleep, say 0.1s? Or would you rather have a different test altogether?

Peter, thank you for testing this out!

Do you think there's any value in committing gnop11.sh as-is? I know you said the I/O completed too quickly on your faster machine; would you prefer a shorter sleep, say 0.1s? Or would you rather have a different test altogether?

You should commit gnop11.sh as is. The stacked use of nullfs is definetly in the spirit of stress2.