Page MenuHomeFreeBSD

Handle ENXIO bufs.
ClosedPublic

Authored by trasz on Feb 19 2017, 5:09 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Apr 22, 7:53 PM
Unknown Object (File)
Mon, Apr 22, 9:51 AM
Unknown Object (File)
Sun, Apr 21, 10:33 AM
Unknown Object (File)
Sun, Apr 21, 10:33 AM
Unknown Object (File)
Sun, Apr 21, 10:33 AM
Unknown Object (File)
Sun, Apr 21, 10:33 AM
Unknown Object (File)
Sun, Apr 21, 10:33 AM
Unknown Object (File)
Sun, Apr 21, 10:33 AM
Subscribers

Details

Summary

Don't try to write out bufs that have already failed with ENXIO. This
fixes some panics after disconnecting mounted disks.

Submitted by: imp (slightly different version, which I've then lost)

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

trasz retitled this revision from to Handle ENXIO bufs..
trasz updated this object.
trasz edited the test plan for this revision. (Show Details)

Do you also have patches in flight for fixes for the msdos thing we talked about on irc?

sys/kern/vfs_bio.c
2296 ↗(On Diff #25380)

I added a comment here that says that one might be able to retry these operations because all errors except ENXIO are expected to be potentially transient (underlying media might work if tried again after EIO, memory might be available after a ENOMEM, etc).

2301 ↗(On Diff #25380)

I may have added a comment here that said that ENXIO buffers were marked invalid because once we see that for a buffer, we should never retry it.

sys/kern/vfs_bio.c
2308 ↗(On Diff #25380)

If such ENXIO buffer has a dependency, i.e. b_dep is not empty, this call is equal to panic(). Re-dirtying the buffer at least allowed the system to survive.

sys/kern/vfs_bio.c
2308 ↗(On Diff #25380)

This is the behavior we had before I made my change r290224 which took out the error == EIO from line 2293.

Pancing here is also bad. Do you have a suggestion for how we might handle that situation better? Or does that involve fixing the soft-update dependency code in UFS which requires writing code to downgrade this mountpoint gracefully even when dirty, which is non-trivial...

sys/kern/vfs_bio.c
2308 ↗(On Diff #25380)

Although looking at softdep_deallocate_dependencies, the panic doesn't happen when error == ENXIO....

That's true, softdeps code has an explicit workaround for ENXIO. That said, I don't think we need to worry about softdep at all at this time - if it panics, it panics; it's not a regression, because it always panicked. It would be nice if UFS without softdep, or msdosfs, didn't panic, and for that, we need to handle those bufs somewhere.

sys/kern/vfs_bio.c
2308 ↗(On Diff #25380)

Softdep work items form a complicated net. If a buffer which still has unsatisfied dependencies is brelse-d, the kernel pointers integrity is compromised and kernel starts accessing random data. I always thought that explicit panic or even deadlock is much better outcome of hardware malfunction than silent user data corruption.

I do not see why should we inflict that on yourself consciously. I mean, first a mechanism to un-knot the SU deps should be developed, which would allow to safely degrade mount into ro mode on error or metadata format violations. After that, a change like this one, can be implemented.

sys/kern/vfs_bio.c
2308 ↗(On Diff #25380)

So we should break systems that don't use soft-updates because soft-updates is broken? I'm not sure I follow the logic in that. We should commit this change because we found the problem now and we can fix other downstream issues later. Requiring all the downstream issues to be fixed seems like an undue burden. This issue will be lost before that happens. And it will deprive people that avoid soft-updates fixes for this issue.

I absolutely agree soft-updates should be fixed to allow degrading to read-only or read-never modes. A lot of our panics at Netflix are due to ufs' and the buffer cache's terrible handling of I/O errors when a drive is failing or dodgy. If this were the only panic we had, I might concede the point we should wait. But it isn't. Far from it.

Perhaps we could just add an additional check to not do this in case of softdep, ie if the buf has dependencies? Or rather, just add a conditional a bit lower to not call buf_deallocate() of them?

Handle 32 bits as well. Note that the 32 bit part is untested.

Bring back the correct diff.

Here is a change that will at least clean up the soft updates associated with the buffer, though as noted if the filesystem keeps running it will put itself into an inconsistent and possibly unrecoverable state. This is no worse than what would happen with a filesystem running without soft updates if you continue operating after failing to do the I/O.

Index: sys/ufs/ffs/ffs_softdep.c
===================================================================
--- sys/ufs/ffs/ffs_softdep.c   (revision 313991)
+++ sys/ufs/ffs/ffs_softdep.c   (working copy)
@@ -14241,6 +14241,10 @@
        }
 }

+static int DoNotCorruptFilesystemOnError = 1;
+SYSCTL_INT(_vfs_ffs, OID_AUTO, DoNotCorruptFilesystemOnError, CTLFLAG_RW,
+          &DoNotCorruptFilesystemOnError, 0,
+          "Panic when unfixable I/O error occurs rather continuing to run");
 /*
  * Called whenever a buffer that is being invalidated or reallocated
  * contains dependencies. This should only happen if an I/O error has
@@ -14258,8 +14262,18 @@
        else
                printf("softdep_deallocate_dependencies: "
                    "got error %d while accessing filesystem\n", bp->b_error);
-       if (bp->b_error != ENXIO)
+       /*
+        * Call softdep_disk_write_complete below to clean up the dependencies
+        * on this buffer. This will leave the filesystem in an inconsistent
+        * state, since soft updates thinks the I/O has completed successfully
+        * when in fact it has not. So we should mark the filesystem as
+        * read-only to prevent further damage, but at this time do not have
+        * a way to do so.
+        */
+       if (bp->b_error != ENXIO || DoNotCorruptFilesystemOnError)
                panic("softdep_deallocate_dependencies: unrecovered I/O error");
+       else
+               softdep_disk_write_complete(bp);
 }

 /*

Here is a change that will at least clean up the soft updates associated with the buffer, though as noted if the filesystem keeps running it will put itself into an inconsistent and possibly unrecoverable state. This is no worse than what would happen with a filesystem running without soft updates if you continue operating after failing to do the I/O.

Index: sys/ufs/ffs/ffs_softdep.c
===================================================================
--- sys/ufs/ffs/ffs_softdep.c   (revision 313991)
+++ sys/ufs/ffs/ffs_softdep.c   (working copy)
@@ -14241,6 +14241,10 @@
        }
 }

+static int DoNotCorruptFilesystemOnError = 1;
+SYSCTL_INT(_vfs_ffs, OID_AUTO, DoNotCorruptFilesystemOnError, CTLFLAG_RW,
+          &DoNotCorruptFilesystemOnError, 0,
+          "Panic when unfixable I/O error occurs rather continuing to run");
 /*
  * Called whenever a buffer that is being invalidated or reallocated
  * contains dependencies. This should only happen if an I/O error has
@@ -14258,8 +14262,18 @@
        else
                printf("softdep_deallocate_dependencies: "
                    "got error %d while accessing filesystem\n", bp->b_error);
-       if (bp->b_error != ENXIO)
+       /*
+        * Call softdep_disk_write_complete below to clean up the dependencies
+        * on this buffer. This will leave the filesystem in an inconsistent
+        * state, since soft updates thinks the I/O has completed successfully
+        * when in fact it has not. So we should mark the filesystem as
+        * read-only to prevent further damage, but at this time do not have
+        * a way to do so.
+        */
+       if (bp->b_error != ENXIO || DoNotCorruptFilesystemOnError)
                panic("softdep_deallocate_dependencies: unrecovered I/O error");
+       else
+               softdep_disk_write_complete(bp);
 }

 /*

This might work, but I liked the trasz' proposal to not terminate the buffer on ENXIO for which dependencies exist. May be, both the optional call to softdep_disk_write_complete() as above, and trasz' idea, should be committed, managed by the common user-settable knob. Also, some explicit explanation to users what are pro and contra of the available setting, need to be provided.

Here is a change that will at least clean up the soft updates associated with the buffer, though as noted if the filesystem keeps running it will put itself into an inconsistent and possibly unrecoverable state. This is no worse than what would happen with a filesystem running without soft updates if you continue operating after failing to do the I/O.

I think there's a difference in our notions of what happens next. When the driver starts returning ENXIO, no further I/O is possible and all future I/Os will return ENXIO. You can never read or write further from the device in that situation. The filesystem cannot become any more or less corrupted than it was when the device was given up for lost. Completing the buffers allow them to be recycled. Ideally, we'd be invalidating all buffers associated with the mount point and transitioning the filesystem to a 'deadfs' mode where it fails all future user requests. The state of softupdates doesn't matter, so long as it doesn't cause panics while the cleanup is going on. If we aren't doing this now, I'd love to work on code to make that happen. So while we may want to do the harder 'read-only' thing for other errors, when you get ENXIO, the underlying storage is gone and we should free up the resources ASAP.

This is quite different than what would happen with EIO or other errors that might be transient (which currently don't take this path, and won't after). This change would only convert ENXIO errors into this failure path. It would leave all other errors as they currently are: retry because it might work if you try again.

So we're only calling the invalidation path on ENXIO when no further I/O is possible to the device. As such, I'm not sure I think that the proposed change that Kirk has is needed. We're already in a situation where whatever we have in memory is all we will ever have and we need to evict it as soon as possible because only some I/Os will complete for the user, and the ones that will are unpredictable.

Perhaps we should chat about this next week at the storage summit.

sys/kern/vfs_bio.c
2304 ↗(On Diff #25443)

I added "Or we failed to write to a device that's no longer present." to the comment here, but that could use some wordsmithing.

I concur with imp's analysis. Once ENXIO is returned no more I/O will be possible, so the disk will remain in whatever state it is in. So it is always safe to just call softdep_disk_write_complete(bp); to clean up the dependencies and free their storage. There is still the issue that clearing some dependencies will allow others to proceed and those other may try to read data from the disk (such as the update of an inode). Almost always the update will be for something that is already cached, but occationally there will be a cache miss and so to ensure success we will need to augment soft updates to handle those cases. This is probably best done by setting a state flag (MNT_DISKGONE?) that indicates that the disk is gone and we are shutting down. In the (rather few) places in the soft depedencies code that we do reads, we check for that flag and provide a usable default which in most cases will just be a zeroed out buffer.

My user setable knob would not be needed here (i.e., we just add the else softdep_disk_write_complete(bp); code) . But as kib points out, it might be useful to incorporate into trasz's proposed change.

I concur with imp's analysis. Once ENXIO is returned no more I/O will be possible, so the disk will remain in whatever state it is in. So it is always safe to just call softdep_disk_write_complete(bp); to clean up the dependencies and free their storage. There is still the issue that clearing some dependencies will allow others to proceed and those other may try to read data from the disk (such as the update of an inode). Almost always the update will be for something that is already cached, but occationally there will be a cache miss and so to ensure success we will need to augment soft updates to handle those cases. This is probably best done by setting a state flag (MNT_DISKGONE?) that indicates that the disk is gone and we are shutting down. In the (rather few) places in the soft depedencies code that we do reads, we check for that flag and provide a usable default which in most cases will just be a zeroed out buffer.

My user setable knob would not be needed here (i.e., we just add the else softdep_disk_write_complete(bp); code) . But as kib points out, it might be useful to incorporate into trasz's proposed change.

SU items reference other items, and processing of items requires that the blocks are available. All proposals that argue that it is fine to drop buffers with pending dependencies assume that it is fine to add complexity to the SU code to handle more cases. This is what I strongly disagree with. SU is already extremely complex piece of code, and making all code pathes handle additional cases does not look too promising. And these cases are hard to test since we have basically only one 'shot' after the device is removed.

In my opinion, pretending that device is still there to allow the normal SU code to get out of the dependencies by completing writes is the only workable scenario to handle device loss under SU mount. If a buffer has dependencies, then they did not completed and this is because there are other deps which reference this buffer.

In my opinion, pretending that device is still there to allow the normal SU code to get out of the dependencies by completing writes is the only workable scenario to handle device loss under SU mount. If a buffer has dependencies, then they did not completed and this is because there are other deps which reference this buffer.

Allow me to restate to make sure I understand.

You'd like to complete buffers with dependencies on them as if they had actually completed w/o error, but keep them valid until the references from the other buffers go away? And you'd like to do this with as minimal impact to the SU code as possible (ideally with no impact). The SU code needs to be able to peek into the data in these buffers to make decisions about what to do with them. Were we to slam the door shut prematurely, they would try to access the buffers which may lead to a crash if they have been reclaimed. Any change in this area should not only make things better, but should ideally help prevent crashes that are easily foreseeable even if such crashes were possible before my (Warner's) change that took out bp->b_errror == EIO and that might have been 'fixed' by that but would now be in play again.

Is that what you'd like to see?

In D9674#201790, @imp wrote:

You'd like to complete buffers with dependencies on them as if they had actually completed w/o error, but keep them valid until the references from the other buffers go away?

Yes, and the indication of such buffer is the bp->b_dep being non-empty.

And you'd like to do this with as minimal impact to the SU code as possible (ideally with no impact). The SU code needs to be able to peek into the data in these buffers to make decisions about what to do with them.

SU needs to undo some changes to data in buffers with dependencies, when doing writes and dependencies are not complete yet (i.e. because other buffers are not yet cleaned). Normally, any buffer with incomplete dependencies is dirty, and cannot be cleared until deps are complete, SU code re-dirties such buffer after write and putting back the reverted data.

Were we to slam the door shut prematurely, they would try to access the buffers which may lead to a crash if they have been reclaimed.

As explained above, SU code assumes that buffers with dependencies cannot go away because the buffers are dirty until dependencies are complete.

Any change in this area should not only make things better, but should ideally help prevent crashes that are easily foreseeable even if such crashes were possible before my (Warner's) change that took out bp->b_errror == EIO and that might have been 'fixed' by that but would now be in play again.

I am not sure what do you mean there. From the high-level view, my opinion on fixing buffer cache interaction with the device errors should consider the main buffer cache consumer UFS and its non-trivial integration with the buffer cache (see above about b_dep and dirty buffers as example). I.e. in my opinion, we should not e.g. fix msdosfs until we know that the fix does not make UFS situation worse or completely paint us into corner for fixing UFS. Because fixing msdosfs is 'just finding corner cases and coding' so to say, while fixing UFS is an architectural issue.

Is that what you'd like to see?

What you explained before the last sentence, is probably a workable strategy for dealing with SU and might be even with SU+J (but +J would certainly require special code to handle journal write errors). I do not claim that this is the only possible or even the best strategy for UFS, but so far nobody thought publicitely about UFS and no other approach was provided. I.e. I do not insist on it, but I consider it as viable and making buffer cache incompatible with this approach without providing either arguments why the approach is not viable or alternative appoach is not good.

Update to incorporate the idea of adding an exception for softupdates.
I'd prefer the softupdate-specific cleanup to be addressed in a separate
patch.

Note that this is only compile-tested for now.

I'm OK with this version.

This revision is now accepted and ready to land.Apr 11 2017, 8:07 PM
This revision was automatically updated to reflect the committed changes.