Page MenuHomeFreeBSD

brelse() should not make buffers dirty again when writes fail
Needs RevisionPublic

Authored by chs on Jul 18 2018, 5:54 PM.

Details

Summary

There is a block of code in brelse() that for some write errors will clear the error and call bdirty() as a way to retry failed writes. This code was introduced 20 years ago:


r43043 | dg | 1999-01-22 00:59:05 -0800 (Fri, 22 Jan 1999) | 7 lines

Don't throw away the buffer contents on a fatal write error; just mark
the buffer as still being dirty. This isn't a perfect solution, but
throwing away the buffer contents will often result in filesystem
corruption and this solution will at least correctly deal with transient
errors.
Submitted by: Kirk McKusick <mckusick@mckusick.com>


At the time, this was a convenient way to handle transient failures when the underlying disk drivers were not very good about retrying operations. Since then, the CAM layer was written, and it has much better retry logic, so this code in brelse() is no longer needed for its original purpose. Further, this redirtying in brelse() does not currently handle B_PAGING buffers correctly and results in panics like this:

panic: cannot reassign paging buffer
cpuid = 22
time = 1521247837
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe257f51e160
vpanic() at vpanic+0x19c/frame 0xfffffe257f51e1e0
panic() at panic+0x43/frame 0xfffffe257f51e240
reassignbuf() at reassignbuf+0x273/frame 0xfffffe257f51e280
bdirty() at bdirty+0x2e/frame 0xfffffe257f51e2a0
brelse() at brelse+0x110/frame 0xfffffe257f51e300
bufdone() at bufdone+0x87/frame 0xfffffe257f51e320
g_io_deliver() at g_io_deliver+0x202/frame 0xfffffe257f51e380
g_io_deliver() at g_io_deliver+0x202/frame 0xfffffe257f51e3e0
g_disk_done() at g_disk_done+0x122/frame 0xfffffe257f51e430
bioq_flush() at bioq_flush+0x87/frame 0xfffffe257f51e460
cam_iosched_fini() at cam_iosched_fini+0x4b/frame 0xfffffe257f51e480
adacleanup() at adacleanup+0x56/frame 0xfffffe257f51e4b0
cam_periph_release_locked_buses() at cam_periph_release_locked_buses+0xa7/frame 0xfffffe257f51e9c0
cam_periph_release_locked() at cam_periph_release_locked+0x1b/frame 0xfffffe257f51e9e0
xpt_done_process() at xpt_done_process+0x68d/frame 0xfffffe257f51ea20
xpt_done_td() at xpt_done_td+0x186/frame 0xfffffe257f51ea70
fork_exit() at fork_exit+0x82/frame 0xfffffe257f51eab0
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe257f51eab0

So the best way to fix the B_PAGING problem with this code is to simply remove it. This patch does exactly that.

Test Plan

Without this patch, making a disk disappear out from under an active file system will frequently result in the above panic. With this patch, the same test no longer produces this panic.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

This revision is now accepted and ready to land.Jul 18 2018, 6:23 PM

I am fine with this change.

On further thought, I realize that while we (correctly) deal with the B_PAGING panic, we simply replace it with the "dangling dependency" panic in UFS along with about 100 UFS metadata corruption panics that occur if the I/O fails when writing a UFS metadata block. So, at a minimum we need to make a callback to the filesystem when we are going to fail a write so that it can take some sort of defensive action (such as forcible unmount). This can be done here where we know that we are about to start the cascade or perhaps in buf_free where we should call buf_deallocate if the reason for the buffer being freed is that it had a failed write.

This revision now requires changes to proceed.Jul 20 2018, 2:58 PM

On further thought, I realize that while we (correctly) deal with the B_PAGING panic, we simply replace it with the "dangling dependency" panic in UFS along with about 100 UFS metadata corruption panics that occur if the I/O fails when writing a UFS metadata block. So, at a minimum we need to make a callback to the filesystem when we are going to fail a write so that it can take some sort of defensive action (such as forcible unmount). This can be done here where we know that we are about to start the cascade or perhaps in buf_free where we should call buf_deallocate if the reason for the buffer being freed is that it had a failed write.

For data blocks, this is absolutely the right fix.

For metadata blocks, we have to write them with a completion routine to indicate we care about the return value. Won't those cases not go through this path anyway since they are required to do their own completions then? Isn't that the normal way to indicate you care about the completion code? It's what we do elsewhere for things like swap errors. bufdone won't go down this path if there's a b_iodone() rotuine set for the buffer. Or am I missing something?

In D16328#347470, @imp wrote:

On further thought, I realize that while we (correctly) deal with the B_PAGING panic, we simply replace it with the "dangling dependency" panic in UFS along with about 100 UFS metadata corruption panics that occur if the I/O fails when writing a UFS metadata block. So, at a minimum we need to make a callback to the filesystem when we are going to fail a write so that it can take some sort of defensive action (such as forcible unmount). This can be done here where we know that we are about to start the cascade or perhaps in buf_free where we should call buf_deallocate if the reason for the buffer being freed is that it had a failed write.

For data blocks, this is absolutely the right fix.

This is not a right fix for the data blocks, at least for some people. https://wiki.postgresql.org/wiki/Fsync_Errors

For metadata blocks, we have to write them with a completion routine to indicate we care about the return value. Won't those cases not go through this path anyway since they are required to do their own completions then? Isn't that the normal way to indicate you care about the completion code? It's what we do elsewhere for things like swap errors. bufdone won't go down this path if there's a b_iodone() rotuine set for the buffer. Or am I missing something?

What if completion routine cannot make a further progress when the buffer write failed ?

You clearly have the specific use case where the system survival is more important that the integrity of the (meta)data. Other people may have other priorities. Basically, the way to un-knot the problem IMO, is to make the behavior selected by a mount flag. It would be a first step in the implementation of 'remount ro on error' flag.

In D16328#347509, @kib wrote:

For data blocks, this is absolutely the right fix.

This is not a right fix for the data blocks, at least for some people. https://wiki.postgresql.org/wiki/Fsync_Errors

So just what is the contract for bwrite? The I/O subsystem has signaled it can't write the block, and that it's tried as hard as it can to do so. fsync isn't required to behave how fsync_Errors says, though I will admit it's an interesting read. It does seem that our current lazy retry stuff has some useful side effects...

In some sense, UFS has signaled that it doesn't care since it used the bwrite for these blocks. Or is that not the contract of bwrite w/o a b_iodone routine? Of course, just because the filesystem doesn't care doesn't mean the user doesn't care enough to try to fsync and interpret the error.

For metadata blocks, we have to write them with a completion routine to indicate we care about the return value. Won't those cases not go through this path anyway since they are required to do their own completions then? Isn't that the normal way to indicate you care about the completion code? It's what we do elsewhere for things like swap errors. bufdone won't go down this path if there's a b_iodone() rotuine set for the buffer. Or am I missing something?

What if completion routine cannot make a further progress when the buffer write failed ?

That's a good question...

But looking at the code we don't do completion routines, as a rule (we do for 'shadow writes' and neuter the brelse behavior on errors), but it's kinda hard to grep for because many of the bwrite is most often used for sync writes (which don't go through this path as far as I can tell, though the code is complicated enough I could easily be wrong which is why I was asking about contracts).

You clearly have the specific use case where the system survival is more important that the integrity of the (meta)data. Other people may have other priorities. Basically, the way to un-knot the problem IMO, is to make the behavior selected by a mount flag. It would be a first step in the implementation of 'remount ro on error' flag.

I'm happy to make at least some of this behavior to be conditional on mount flags. I'd like to understand how we're using / abusing the buffer cache and try to document things as best we can since we have some number of differences from Bach's book which documents the classic Unix behavior...

In D16328#347678, @imp wrote:
In D16328#347509, @kib wrote:

For data blocks, this is absolutely the right fix.

This is not a right fix for the data blocks, at least for some people. https://wiki.postgresql.org/wiki/Fsync_Errors

So just what is the contract for bwrite? The I/O subsystem has signaled it can't write the block, and that it's tried as hard as it can to do so. fsync isn't required to behave how fsync_Errors says, though I will admit it's an interesting read. It does seem that our current lazy retry stuff has some useful side effects...

I am not sure that you are asking question you intend to ask. Do you mean bwrite(9) or bdwrite(9), there and below ? UFS tries to handle bwrite() errors, in some/most cases. For bdwrite(9) it is fire and expect that buffer cache would ensure the write. If the write cannot be performed, UFS does not deal with the situation, although some recent changes by Kirk started some important code rearrangements.

In some sense, UFS has signaled that it doesn't care since it used the bwrite for these blocks. Or is that not the contract of bwrite w/o a b_iodone routine? Of course, just because the filesystem doesn't care doesn't mean the user doesn't care enough to try to fsync and interpret the error.

No, I completely disagree. UFS did not signalled anything, UFS code just lacks error recovery from the storage layer failures.

For metadata blocks, we have to write them with a completion routine to indicate we care about the return value. Won't those cases not go through this path anyway since they are required to do their own completions then? Isn't that the normal way to indicate you care about the completion code? It's what we do elsewhere for things like swap errors. bufdone won't go down this path if there's a b_iodone() rotuine set for the buffer. Or am I missing something?

What if completion routine cannot make a further progress when the buffer write failed ?

That's a good question...

But looking at the code we don't do completion routines, as a rule (we do for 'shadow writes' and neuter the brelse behavior on errors), but it's kinda hard to grep for because many of the bwrite is most often used for sync writes (which don't go through this path as far as I can tell, though the code is complicated enough I could easily be wrong which is why I was asking about contracts).

We do use completion routines, thing is we have more of them than only biodone. Somewhat obscure but equally important are the bio_ops.io_start and bio_ops.io_complete, which are SU hooks.

You clearly have the specific use case where the system survival is more important that the integrity of the (meta)data. Other people may have other priorities. Basically, the way to un-knot the problem IMO, is to make the behavior selected by a mount flag. It would be a first step in the implementation of 'remount ro on error' flag.

I'm happy to make at least some of this behavior to be conditional on mount flags. I'd like to understand how we're using / abusing the buffer cache and try to document things as best we can since we have some number of differences from Bach's book which documents the classic Unix behavior...

Bach book is completely outdated. I am not even sure that reading it gives any useful knowledge WRT FreeBSD buffer cache, instead of increasing the confusion.

One important rule that I learned hard for vfs_bio.c is that b_flags do not have a semantic. You cannot rely on the description to know what would happen with a buffer with specific flags combination. Instead, you need to trace the code to see exactly what happens.

As kib has pointed out, much of UFS is not prepared to deal with failure.

If the filesystem asks to write something, especially metadata, it expects to eventually succeed. The current approach is to just keep redirtying the buffer until it is successfully written. Historically this worked because if a sector was bad, it would eventually get remapped through some hardware/driver mechanism. Other changes that were dependent on the write, are held pending until the write succeeds, so the filesystem stays consistent if a crash occurs.

If for some reason the write cannot be made to succeed then the buffer cache fills with dirty and/or delayed writes until the kernel eventually runs out of memory that it will dedicate to buffers and the filesystem deadlocks. This most commonly occurs when a disk goes offline (dies or is removed in the case of a flash drive).

Another issue that can lead to panic is a request for an fsync that panics because it is unable to eliminate all the dirty buffers associated with the vnode being fsync'ed. The fsync system call returns an error, but some of the internal uses of fsync panic if the fsync fails.

The I/O subsystem is now robust enough that it will exhaust the possible recovery methods before returning an error. So redirtying the buffer and trying again is no longer a viable strategy. The goal of my "filesystem hardening" work is to more gracefully handle permanent disk errors.

For errors such as ENODEV where the disk has gone away, the only viable strategy is to forcibly unmount the filesystem. If it is the root filesystem this approach is no better than a panic. But if it is some less critical filesystem, the system could continue to run.

For cases where only a subset of the disk has failed, it might be possible to forcibly downgrade the filesystem to read-only which would let the system continue to operate even if it was the root filesystem. This mode is more difficult to assess when it is a viable strategy and if it is possible to actually do it. So, the initial work will be to implement just forcible unmount and then decide how feasible it would be to implement forcible downgrade to read-only.

Which now circles us back to the discussion of what we do in terms of this proposed patch. Based on the discussion to this point, I believe that we should keep this code path, but instead of the current redirtying, we should create a new buf_writefail() entry that allows each filesystem to decide how to deal with the failure.

The default behavior which would be applied to all filesystems that use the buffer cache would be to keep the old behavior (redirty the buffer and hope for the best).

For UFS, we could use a mount option to decide keep the old behavior or to initiate forcible unmount or forcible downgrade to read-only. Those working on other filesystems (msdosfs, ext2, etc) could choose to implement similar alternatives to the default.

I'll try to address all the points that everyone has made so far here.

Kirk wants a callback when a write fails, to allow the fs to do something to prevent corruption. Does this really need to be a new callback mechanism, or can the existing b_iodone callback be used for this? If we use b_iodone for this then essentially every buffer would need to set b_iodone. Note that currently if b_iodone is non-NULL then the softdep io_completion callback is not called. It looks like the expedient choice would be to add a new callback mechanism for this to biodone(). I suspect this new callback would need to be called before any of the existing callback mechanisms.

(For those who don't know me, my background includes many years working on the Veritas File System (vxfs), so I'll try to apply some ideas from vxfs to FreeBSD file systems.)

vxfs has the concept of a file system being "disabled", which means (among other things) that all disk I/O requests that would normally be submitted to the underlying disk driver are failed without actually being attempted. In FreeBSD, something like this would provide the primary benefit of the forced unmount idea that has been discussed, namely preventing subsequent writes from creating additional inconsistencies in the file system image on disk. A FreeBSD version of this could be implemented by having the new I/O error callback call into the g_vfs layer to remove the file system's read/write permission on the g_vfs.

After the underlying disk is no longer writable by the file system, a forced unmount could be performed asynchronously, either in the kernel or from userland via devd. Similarly, a remount-to-ro policy could remove just write access from the g_vfs, and the fs could be remounted asynchronously, again either in the kernel or from userland via devd. These unmount/remount operations would have to be done after unlocking the buffer that could not be written, since the unmount/remount path would want to lock the same buffer. But that should be harmless if that only happens after writing to the fs is blocked.

Konstantin wants there to be a mount option to allow for selecting the I/O error policy to be used used for a particular file system. That's certainly fine.

Konstantin also wants to do something about the fsync() error-reporting conundrum. In vxfs, the approach was to record data write failures with a flag on the inode. Once set, this flag would remain set until the inode went inactive. If fsync() saw that the flag was set, it would report an error even if flushing all currently dirty buffers succeeded. We could do the same thing in FreeBSD, though generically for all file systems with a vnode flag, rather than in the fs-specific code like vxfs does.

Warner doesn't seem to have any specific requirements yet, just a desire for sanity. :-)

Are these approaches to addressing the various issues satisfactory? Are there any other issues that anyone thinks should be included in this set of changes?

I think the current redirty-on-error check in brelse() really still needs to go away. Keeping the current redirty code in brelse() would require fixing that code to handle B_PAGING buffers, which seems non-trivial. If we want to support a retry-forever policy for disk I/O then I think that really belongs in CAM (and any non-CAM disk drivers) rather than in the buffer cache code. The interface to select such a disk driver retry policy could still be a file system mount option if that is what people like.

In D16328#349738, @chuq_chuq.com wrote:

I'll try to address all the points that everyone has made so far here.

This summary is good as the list of points made by participants, but it misses the most important fact which would explain both my and Kirk attitude (sorry to Kirk if I am putting my words), as well as the existing code, in particular, the reason for the re-dirty.

When filesystem bwrite() or brelse() the buffer, it expects that the next bread() returns the same content as of time of brelse(), or error. It is not allowed to return earlier content. For instance, we do expect that the content of the inode block is consistent with the inode pointed to by the vnode. It is impossible to silently redirect new content of the buffers to NULL and expect the filesystems to continue work.

From this PoV, the re-dirty hack provides the expected behavior. If we remount the filesystem to ro after an error, the same consistency guarantees are required, since e.g. the emulation of the successful write of the directory entry for an inode causes update of the inode link count. If some of the inter-depended data is old and some new, even ro access would not be possible due to the inconsistent metadata.

All this does not even try to discuss what would userspace see in the data.

The idea of channelling everything through b_iodone is aesthetically pleasing, but impractical. Adding an extra indirect call to every buffer operation introduces a lot of overhead for a very rare event. It also requires changes to every filesystem that uses the buffer cache (which is most of them). So, the idea of adding a function that gets called only when a disk goes offline that can have a default function that keeps the current (albeit broken) behavior of redirtying the buffer is both efficient and requires far less code churn.

As Konstantin has pointed out, downgrading to read-only is far more challenging than simply forcibly unmounting. UFS ensures that the on-disk state is always recoverably consistent, though behind the in-memory state. So when a disk goes offline, a forcible unmount need only clear out and free all the in-memory state (which is what a crash does in one fell-swoop :-)

So, once an unrecoverable write error occurs, we need to switch into a mode that walks through the in-memory structures and frees them up. We have much of this infrastructure in place for doing forcible unmounts. We just need to fill out the places where it does not expect an error. Here, the filesystem may think it is still able to do I/O, but as Chuck has pointed out, the I/O requests are just reflected back immediately with an I/O error.