Details

Reviewers

kib
markj
mjg
mckusick
chs
pho

Commits

rGc44685732899: Add stress2 test to exercise FFS forcible unmount with stacked nullfs
rG2bc16e8aaf75: VFS: remove MNTK_MARKER
rG211ec9b7d6ec: FFS: remove ffs_fsfail_task
rGc746ed724d01: Allow stacked filesystems to be recursively unmounted

Summary

In certain emergency cases such as media failure or removal, UFS will
initiate a forced unmount in order to prevent dirty buffers from
accumulating against the no-longer-usable filesystem. The presence
of a stacked filesystem such as nullfs or unionfs above the UFS mount
will prevent this forced unmount from succeeding.

This change addreses the situation by allowing stacked filesystems to
be recursively unmounted on a taskqueue thread when the MNT_RECURSE
flag is specified to dounmount(). This call will block until all upper
mounts have been removed unless the caller specifies the MNT_TASKQUEUE
flag to indicate the base filesystem should also be unmounted from the
taskqueue.

To achieve this, the recently-added vfs_pin_from_vp()/vfs_unpin() KPIs
have been combined with the existing 'mnt_uppers' list used by nullfs
and renamed to vfs_register_upper_from_vp()/vfs_unregister_upper().
The format of the mnt_uppers list has also been changed to accommodate
filesystems such as unionfs in which a given mount may be stacked atop
more than one lower mount. Additionally, management of lower FS
reclaim/unlink notifications has been split into a separate list
managed by a separate set of KPIs, as registration of an upper FS no
longer implies interest in these notifications.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 40464
Build 37353: arc lint + arc unit

Event Timeline

jah created this revision.Jul 4 2021, 2:11 AM

Herald added a subscriber: imp. · View Herald TranscriptJul 4 2021, 2:11 AM

jah requested review of this revision.Jul 4 2021, 2:11 AM

Harbormaster completed remote builds in B40258: Diff 91716.Jul 4 2021, 2:11 AM

This is a first draft of the change; I've tested it enough to verify that it works, but I'd like feedback on the basic approach as well as some specific questions I have below.

sys/kern/vfs_subr.c
3927	I didn't see an obvious reason for the use of these marker nodes. Were they intended to keep the lower FS from being unmounted during an in-progress notification by virtue of mnt_uppers being non-empty, or is there some other reason they're needed? If we no longer need them, I should be able to clean this code up quite a bit.
sys/ufs/ffs/ffs_vfsops.c
300	It seems like we should be able to clean the FFS error handling code up quite a bit by just directly issuing dounmount with MNT_FORCE \| MNT_RECURSE \| MNT_TASKQUEUE from the failing context. That would allow us to get rid of the fsfail_task construct and would also avoid blocking the global taskqueue_thread with a potentially-expensive recursive unmount.

jah added reviewers: kib, markj, mjg, mckusick, chs.Jul 4 2021, 2:21 AM

kib added inline comments.Jul 10 2021, 3:19 PM

sys/kern/vfs_mount.c
2028	Suppose that this delayed unmount becomes popular enough, and one of our upper mounts also has upper mounts. Wouldn't the system deadlock?
sys/kern/vfs_subr.c
3927	They are used in the current code because iteration is protected by mount interlock, but we obviously have to drop the interlock for the notification call. This allows modification of the tailq, and we need to regain the iteration position after the call.

jah added inline comments.Jul 10 2021, 4:37 PM

sys/kern/vfs_mount.c
2028	This is the case where the "base" filesystem is being unmounted, and the calling thread has requested for it to be done synchronously, i.e. MNT_TASKQUEUE is not present in flags. The uppers, and the uppers of uppers etc., will all be unmounted from the taskqueue context with MNT_TASKQUEUE and won't block here. I've tested recursively unmounting ufs->unionfs->nullfs and it does work. Given a stacked filesystem hierarchy A->B->C, there could also be a case in which both 'A' and 'B' could be requested to synchronously (and recursively) unmount, but those requests would need to happen in different threads, neither of which could be the taskqueue. Is there a specific scenario you're concerned about?
sys/kern/vfs_subr.c
3927	I think that's the part I don't understand; uppers can't be removed from the queue during this window, they can only be added at the tail. That means they'll be added somewhere after the marker and picked up by the iteration anyway. Is there some other problem I've missed?

kib added inline comments.Jul 10 2021, 5:07 PM

sys/kern/vfs_mount.c
2028	I mean, you delegated one unmount to the taskqueue, and now taskqueue executes it. If this delegated unmount needs to delegate another unmount, it schedules a task and sleeps waiting for the task to finish. But it sleeps in the context of the taskqueue which should execute that another unmount, so the new task basically never picked up for execution.
sys/kern/vfs_subr.c
3927	What prevents a removal from the list?

jah added inline comments.Jul 10 2021, 5:19 PM

sys/kern/vfs_mount.c
2028	The delegated unmount doesn't sleep though. Since it was issued by the taskqueue, it will have MNT_TASKQUEUE in 'flags' and instead will just requeue itself at the end of recursive unmount queue and return EINPROGRESS.
sys/kern/vfs_subr.c
3927	vfs_drain_upper_locked(), which is called by both vfs_unregister_upper() and vfs_unregister_for_notification(). Even before this change, nullfs waited for MNTK_VGONE_UPPER to drain, although it doesn't seem like that would've worked with more than one in-progress notification.

mckusick added inline comments.Jul 13 2021, 12:22 AM

sys/ufs/ffs/ffs_vfsops.c
300	I concur with your suggestion.

--Make forced taskqueue unmount operations resilient to spurious failure

A concurrent unmount attempt from another thread (which may fail) or a
concurrent update mount may induce a spurious failure.  Allow the
tasqueue to requeue the unmount request in these cases.

--Fix refcounting for the case when amother thread requests deferred

unmount of a base filesystem.

--Add locking annotations for new fields

Harbormaster completed remote builds in B40464: Diff 92170.Jul 14 2021, 5:45 AM

Simplify by requiring taskqueue unmounts to be recursive

Harbormaster completed remote builds in B40470: Diff 92185.Jul 14 2021, 1:08 PM

@jah - are you ready to have Peter Holm test these changes? If so, I will enlist his help.

kib accepted this revision.Jul 15 2021, 3:58 PM

kib added inline comments.

sys/kern/vfs_mount.c
1959	There should be a blank line before multi-line comment, and after the end of the code which is described by it.

This revision is now accepted and ready to land.Jul 15 2021, 3:58 PM

jah added a reviewer: pho.Jul 15 2021, 11:38 PM

In D31016#701737, @mckusick wrote:

@jah - are you ready to have Peter Holm test these changes? If so, I will enlist his help.

Yes, in fact I've been using the unionfs/nullfs stress2 tests as the basis for crude manual testing of this code, but it would be nice to have something automated.
It'd be nice to have an automated that exercises recursive unmounts concurrently with both other recursive and non-recursive unmounts within the same stacked filesystem hierarchy.

I've added Peter here. I gladly help write any needed tests, but I could use some guidance on what we might already have that could be a starting point.

jah added inline comments.Jul 15 2021, 11:49 PM

sys/kern/vfs_subr.c
3927	@kib can the code for marker nodes safely be removed?
sys/ufs/ffs/ffs_vfsops.c
300	Is there a good way to test the fsfail case? I've shied away from doing the cleanup mentioned here without a good way to test it.

In D31016#702230, @jah wrote:

In D31016#701737, @mckusick wrote:

@jah - are you ready to have Peter Holm test these changes? If so, I will enlist his help.

Yes, in fact I've been using the unionfs/nullfs stress2 tests as the basis for crude manual testing of this code, but it would be nice to have something automated.
It'd be nice to have an automated that exercises recursive unmounts concurrently with both other recursive and non-recursive unmounts within the same stacked filesystem hierarchy.

I've added Peter here. I gladly help write any needed tests, but I could use some guidance on what we might already have that could be a starting point.

I'm running tests with D31016.92185.diff right now.

Suggestion on how to run forcible unmount test on UFS.

sys/ufs/ffs/ffs_vfsops.c
300	Is there a good way to test the fsfail case? I've shied away from doing the cleanup mentioned here without a good way to test it. Peter Holm has some good tests that start up activity on a UFS filesystem and then kill the underlying disk device. See tools/test/stress2/misc/gnop8.sh as a prototype for how to run the test.

kib added inline comments.Jul 17 2021, 4:52 PM

sys/kern/vfs_subr.c
3927	Yes it can be removed now, I agree.

Fix whitespace, reap MNTK_MARKER, remove fsfail_task and add associated test

This revision now requires review to proceed.Jul 19 2021, 2:19 AM

Harbormaster completed remote builds in B40520: Diff 92403.Jul 19 2021, 2:19 AM

In D31016#702390, @mckusick wrote:

Suggestion on how to run forcible unmount test on UFS.

Thanks! I've added the cleanup of fsfail_task to this review, along with the 'gnop11.sh' test I cobbled together to exercise it.

I still see this: https://people.freebsd.org/~pho/stress/log/log0139.txt

I am fine with the patch in review, but think that the actual commits should be split. In particular:

new test
removal of the task from UFS code
removal of the marker

needs to be a separate commits, at least.

This revision is now accepted and ready to land.Jul 19 2021, 11:21 AM

chs added inline comments.Jul 19 2021, 5:20 PM

sys/sys/mount.h
262	All the other struct mount fields have a "mnt_" prefix, these new ones ought to have that prefix as well.
440	You should update this comment with the names of any new flags that are now valid to use with unmount(). It looks like MNT_BYFSID can already be used with unmount, so that flag should be listed here too.
457	The MNT_* flags are part of the syscall interface, and the name of an interface flag ought to describe what it does rather than how it does it. This is really about making the unmount asynchronous, and the fact that a taskqueue is the kernel abstraction that is used to make the operation asynchronous is just an internal detail. I would suggest "MNT_ASYNC" as the name, but I see there's already a flag with that name. However, MNT_ASYNC is currently only used for mount and not for unmount, so there would be no ambiguity to use the existing MNT_ASYNC flag in this new context. Or are these new flags not intended to be part of the syscall interface? I don't see any change here to add support in the umount command for using these new flags.

Rebase

This revision now requires review to proceed.Jul 19 2021, 5:52 PM

Harbormaster completed remote builds in B40532: Diff 92433.Jul 19 2021, 5:52 PM

Split into multiple commits

Harbormaster completed remote builds in B40533: Diff 92434.Jul 19 2021, 5:59 PM

Prefix new fields with mnt_, MNT_TASKQUEUE -> MNT_DEFERRED

Harbormaster completed remote builds in B40534: Diff 92436.Jul 19 2021, 7:11 PM

jah added inline comments.Jul 19 2021, 7:24 PM

sys/sys/mount.h
262	Done.
457	Good point, I've renamed MNT_TASKQUEUE to MNT_DEFERRED. That said, these new flags aren't intended to be exposed through the sysctl interface. I wouldn't object to exposing MNT_RECURSE through something like a '-r' option to umount(8), but I see no reason to ever expose MNT_DEFERRED to userspace. This reminds me of something that occurred to me while I was writing the code though: these flags are effectively prevented from being exposed to userspace because they're outside the 32-bit range of unmount_args->flags. It seems a bit fragile to rely on that though...should we have something like an MNT_KERNEL mask for flags that are only allowed to be passed from kernel-initiated operations, and return EINVAL for any userspace request that includes any of them?

This reminds me of something that occurred to me while I was writing the code though: these flags are effectively prevented from being exposed to userspace because they're outside the 32-bit range of unmount_args->flags.
It seems a bit fragile to rely on that though...should we have something like an MNT_KERNEL mask for flags that are only allowed to be passed from kernel-initiated operations, and return EINVAL for any userspace request that includes any of them?

In theory MNT_VISFLAGMASK are externally visible and MNT_CMDFLAGS are internal only. The MNT_VISFLAGMASK flags are the only ones returned in a statfs(2) call but are not checked on system calls as far as I can tell.
Adding some clarity / enforcement would likely be a good idea, but not as part of these changes.

I am fine with the structure of this patch. I do ask that you await passing Peter Holm's tests before committing.

This revision is now accepted and ready to land.Jul 19 2021, 11:42 PM

In D31016#703115, @mckusick wrote:

I am fine with the structure of this patch. I do ask that you await passing Peter Holm's tests before committing.

I have started a full test with D31016.92436.diff.

In D31016#703154, @pho wrote:

In D31016#703115, @mckusick wrote:

I am fine with the structure of this patch. I do ask that you await passing Peter Holm's tests before committing.

I have started a full test with D31016.92436.diff.

All done. No problems seen.

20210722 19:00:47 all.sh done, elapsed 2 day(s), 04:40.10

In D31016#704447, @pho wrote:
All done. No problems seen.
20210722 19:00:47 all.sh done, elapsed 2 day(s), 04:40.10

And no nullfs issues observed? I am confused.

In D31016#704507, @kib wrote:
In D31016#704447, @pho wrote:
All done. No problems seen.
20210722 19:00:47 all.sh done, elapsed 2 day(s), 04:40.10
And no nullfs issues observed? I am confused.

Yes, sorry. What I should have stated was that all of the stress2 tests were run, excluding known problem tests.
Specifically the force4.sh test, a nullfs / mdconfig -o force test, was excluded. This test failed before this patch and also fails with this patch.
I have lately tested a separate patch (by you) combined with D31016.92436.diff that fixes the issue seen with the force4.sh test. This combo is still being tested.

In D31016#704531, @pho wrote:
In D31016#704507, @kib wrote:
In D31016#704447, @pho wrote:
All done. No problems seen.
20210722 19:00:47 all.sh done, elapsed 2 day(s), 04:40.10
And no nullfs issues observed? I am confused.
Yes, sorry. What I should have stated was that all of the stress2 tests were run, excluding known problem tests.
Specifically the force4.sh test, a nullfs / mdconfig -o force test, was excluded. This test failed before this patch and also fails with this patch.
I have lately tested a separate patch (by you) combined with D31016.92436.diff that fixes the issue seen with the force4.sh test. This combo is still being tested.

Peter, thank you for testing this out!

Do you think there's any value in committing gnop11.sh as-is? I know you said the I/O completed too quickly on your faster machine; would you prefer a shorter sleep, say 0.1s? Or would you rather have a different test altogether?

Peter, thank you for testing this out!

Do you think there's any value in committing gnop11.sh as-is? I know you said the I/O completed too quickly on your faster machine; would you prefer a shorter sleep, say 0.1s? Or would you rather have a different test altogether?

You should commit gnop11.sh as is. The stacked use of nullfs is definetly in the spirit of stress2.

Closed by commit rGc746ed724d01: Allow stacked filesystems to be recursively unmounted (authored by jah). · Explain WhyJul 24 2021, 8:39 PM

This revision was automatically updated to reflect the committed changes.

jah added a commit: rGc746ed724d01: Allow stacked filesystems to be recursively unmounted.

jah added a commit: rG2bc16e8aaf75: VFS: remove MNTK_MARKER.

jah added a commit: rG211ec9b7d6ec: FFS: remove ffs_fsfail_task.

jah added a commit: rGc44685732899: Add stress2 test to exercise FFS forcible unmount with stacked nullfs.

Allow stacked filesystems to be recursively unmounted
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 92170

sys/fs/nullfs/null.h

sys/fs/nullfs/null_vfsops.c

sys/fs/unionfs/union.h

sys/fs/unionfs/union_vfsops.c

sys/kern/vfs_mount.c

sys/kern/vfs_subr.c

sys/sys/mount.h

sys/ufs/ffs/ffs_vfsops.c

Allow stacked filesystems to be recursively unmountedClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 92170

sys/fs/nullfs/null.h

sys/fs/nullfs/null_vfsops.c

sys/fs/unionfs/union.h

sys/fs/unionfs/union_vfsops.c

sys/kern/vfs_mount.c

sys/kern/vfs_subr.c

sys/sys/mount.h

sys/ufs/ffs/ffs_vfsops.c

Allow stacked filesystems to be recursively unmounted
ClosedPublic
Actions

Revision Contents
Changeset List