handle disk I/O errors in FFS with softdep enabled.
ClosedPublic
Actions

Authored by chs on Mar 16 2020, 6:19 PM.

Details

Reviewers

mckusick
kib
imp
jeff

Commits

rS361491: This commit enables a UFS filesystem to do a forcible unmount when

Summary

This patch enables a filesystem to do a forcible unmount when the
underlying media fails or generally goes away. For example when a
USB flash memory card hosting a UFS filesystem is unplugged.

The basic strategy for handling disk I/O errors when softdep is enabled
is to stop writing to the disk of the affected file systemand mostly pretend
that all future writes by the file system to that disk actually succeed,
and then trigger an asynchronous forced unmount of the affected file system.

There are two cases for disk I/O errors:

ENXIO, which means that this disk is gone and the lower layers of the storage stack already guarantee that no future I/O to this disk will succeed.
EIO (or most other errors), which means that this particular I/O request has failed but subsequence I/O requests to this disk might still succeed.

For ENXIO, we can just clear the error and continue, because we know that
the file system cannot affect the on-disk state after we see this error.
For EIO or other errors, we arrange for the geom_vfs layer to reject
all future I/O requests with ENXIO just like is done when the geom_vfs is
orphaned, and then the file system code can again just clear the error and
continue on.

This new treatment of I/O errors is needed for writes of any buffer that is
involved in a dependency (most of which are explicit via a structure attached
to bp->b_dep, but some of which are implicit), and also for any read that is
necessary to write other buffers which are involved in a dependency.
In this patch we apply the new treatment to all writes and to the specific
reads that are done as part of flushing buffers involved in dependencies.
When such a read actually does fail, we fabricate a buffer full of zeroes
and pretend that the read succeeded. The only case where a buffer full of
zeros causes the code to do the wrong thing is when reading an inode buffer
containing an inode that still has an inodedep in memory that will reinitialize
ip->i_effnlink based on the ip->i_nlink that we read, and to handle this we
now store the ip->i_nlink value that we wrote in the inodedep as well.

fsync() and msync() do need to return errors if data fails to be written
to stable storage, and so these operations return ENXIO for every call made
on files in a file system where we have otherwise been ignoring I/O errors.

Test Plan

This patch has been extensively tested by myself, Kirk, and Peter Holm.
An earlier version has been in production at Netflix for several months.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 31071
Build 28768: arc lint + arc unit

Event Timeline

chs created this revision.Mar 16 2020, 6:19 PM

Harbormaster completed remote builds in B29966: Diff 69573.Mar 16 2020, 6:19 PM

cem added a subscriber: cem.Mar 16 2020, 6:49 PM

chs edited the summary of this revision. (Show Details)Mar 18 2020, 5:26 PM

kib added inline comments.Mar 19 2020, 9:16 PM

sys/kern/vfs_bio.c
2778	Is it reasonable to clean BX_CVTENXIO if the buffer still has the valid content ? If the buffer is revived by gbincore(), then should we still operate as if the flag is set ?
sys/ufs/ffs/ffs_alloc.c
1154	Why do you unconditionally use FFSV_FORCEINSMQ ? This defeats forced unmounts. (And I have a memory that this was already discussed).

chs added inline comments.Mar 30 2020, 9:01 PM

sys/kern/vfs_bio.c
2778	The purpose of BX_CVTENXIO is to prevent reading potentially stale data back from disk after we have gotten EIO trying to write it (since we don't know what is actually on disk after a write reports an error). If the latest version of the data for a buffer is still in memory then it's fine to keep using it since we know it is consistent with anything else still in memory.
sys/ufs/ffs/ffs_alloc.c
1154	Yea, we did discuss this before, but I never finished following up on it, sorry about that. Could you elaborate on exactly how this defeats forced unmounts? We haven't seen any problems in testing. insmntque1() only starts rejecting new requests after MNTK_UNMOUNTF is set in dounmount(), and in ffs_unmount() one of the first things we do is call vfs_write_suspend_umnt(), which will block any further operations which might use FFSV_FORCEINSMQ and wait for any such operations in progress to complete. I don't see anything in between those two points which requires that insmntque1() will already be rejecting new vnodes, is there something of that nature that I'm overlooking?

kib mentioned this in rS359766: ufs: apply suspension for non-forced rw unmounts..Apr 10 2020, 1:24 AM

pho added a subscriber: pho.Apr 10 2020, 6:39 AM

gbe added a subscriber: gbe.May 1 2020, 1:22 PM

Are there any more comments on this diff?

In D24088#542747, @chs wrote:

Are there any more comments on this diff?

Doesn't seem to patch cleanly against r360560.
Fixed that and ran a full stress2 test. LGTM.

update diff to 360823

Harbormaster completed remote builds in B30994: Diff 71571.May 8 2020, 8:26 PM

kib added inline comments.May 8 2020, 11:18 PM

sys/ufs/ffs/ffs_subr.c
229	Why do it in loop ? Why do you expect further attempts to succeed if one of them failed ? I am afraid that one failure implies further failures, and then taskqueue thread eats CPU and deadlocks system because other tasks cannot be run.
266	Dont' we need to drain the task somewhere ? Obvious required place is on ufs.ko unload.
294	Shouldn't this printf be either removed or changed to KASSERT ? I believe there is more than one such place.
296	return (ENXIO);
298	Why not use vfs_bio_bzero_buf() ? Or, if this function is not supposed to get unmapped buffer, assert it.
sys/ufs/ffs/ffs_vfsops.c
155	Use CTLFLAG_RWTUN instead of separate TUNABLE_INT ?
sys/ufs/ufs/ufs_vnops.c
2735	Is this needed ? stdputpages() does VOP_WRITE(), wouldn't you catch error/convert to ENXIO there already ?
sys/ufs/ufs/ufsmount.h
48	You could use sys/_task.h there, it would cause much less namespace pollution.

chs added inline comments.May 12 2020, 11:27 PM

sys/ufs/ffs/ffs_subr.c
229	My recollection is that I wanted to handle a race with another unmount. If this automatic unmount failed because a manual one is in progress and then the manual one fails (perhaps because it is not a forced unmount), then we would want to retry the automatic one. But looking again now, I see that racing unmounts are serialized by the vnode lock of mp->mnt_vnodecovered, so the scenario I just described cannot happen. We previously discussed one situation that currently causes forced unmounts to fail, which is if the mount is being used by a null mount (ie. mnt_uppers is not NULL). Kirk wants to have dounmount() internally unmount any mnt_uppers recursively, and once I implement that then a forced unmount should never fail. I'll change this to not retry the unmount.
266	That's true, I'll add that.
294	I'll change these to KASSERTs.
296	I'll change it.
298	I didn't know about vfs_bio_bzero_buf() (and apparently neither did Kirk). I'll switch to using that.
sys/ufs/ffs/ffs_vfsops.c
155	I didn't know about CTLFLAG_RWTUN, I'll use that.
sys/ufs/ufs/ufs_vnops.c
2735	Once we enter this error recovery mode, FFS pretends that all disk writes succeed, so we need to explicitly check if we're in the recovery mode in order to return an error in the paths that should. ffs_write() doesn't have such a check so we wanted one here so that msync() can return an error in this case. It occurs to me that ffs_write() ought to have an explicit check for recovery mode for the benefit of write() with O_SYNC, and then that check there would also do the right thing for msync() without a special vop_putpages. I'll change it that way.
sys/ufs/ufs/ufsmount.h
48	Ok, I'll change that.

update to address review comments.

Harbormaster completed remote builds in B31070: Diff 71742.May 13 2020, 11:34 PM

really remove putpages wrapper this time.

Harbormaster completed remote builds in B31071: Diff 71743.May 13 2020, 11:42 PM

kib accepted this revision.May 15 2020, 2:39 PM

kib added inline comments.

sys/ufs/ffs/ffs_alloc.c
1154	This might be indeed less an issue now that we suspend filesystem on unmount. Otherwise, always using FORCEINSMQ there means that umount cannot flush(9) vnodes that are being created after the unmount started, so it cannot ever finish. As you see, old code (on the left) only used FORCEINSMQ to recheck for dup alloc, but otherwise did vput() immediately, which together with ip->i_mode == 0 caused immediate reclamation, under the vnode lock. This is actually a nice property of the code, and I fill uneasy breaking it, even if suspend at unmount allows us to claim that the change is formally correct. Anyway, I will not block the change on this issue, lets work it out later.

This revision is now accepted and ready to land.May 15 2020, 2:39 PM

Peter,
Can you run one more test run over this code just to be sure that it patches cleanly and that none of our final changes broke anything?

In D24088#547321, @mckusick wrote:

Peter,
Can you run one more test run over this code just to be sure that it patches cleanly and that none of our final changes broke anything?

I ran the full stress2 test on r361090 with D24088.71743.diff. No problems seen.