gmirror: treat ENXIO as disk disconnect, not media error
ClosedPublic
Actions

Authored by avg on Feb 6 2017, 6:51 PM.

Details

Reviewers

markj
mav
pjd
imp

Commits

rS323612: gmirror: treat ENXIO as disk disconnect, not media error

Summary

In theory, all data access errors mean that a member is out of sync
at most. But they were treated as more serious errors to avoid the
situation where a flaky disk gets repeatedly disconnected, re-synchronized,
reconnected and then disconnected again.

ENXIO is a special error that means that the member disk disappeared,
so it should get the same handling as the GEOM orphaning event.
There is a better chance that when the disk is reconnected, it will be
a good member again.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

avg updated this revision to Diff 24809.Feb 6 2017, 6:51 PM

avg retitled this revision from to gmirror: treat ENXIO as disk disconnect, not media error.

avg updated this object.

avg edited the test plan for this revision. (Show Details)

avg added reviewers: mav, markj, pjd.

Herald added a subscriber: imp. · View Herald TranscriptFeb 6 2017, 6:51 PM

markj added inline comments.Feb 6 2017, 7:17 PM

sys/geom/mirror/g_mirror.c
985 ↗	(On Diff #24809)	There is similar code for handling errors from metadata updates and synchronization I/O. Do we want similar handling there?

avg added inline comments.Feb 7 2017, 9:29 AM

sys/geom/mirror/g_mirror.c
985 ↗	(On Diff #24809)	I think that the metadata update error is certainly a more grave error as the affected member would have a mismatching configuration on disk. I haven't made up my mind about the synchronization error, so I'd prefer to keep the status quo for it.

markj accepted this revision.Feb 7 2017, 5:57 PM

markj edited edge metadata.

This revision is now accepted and ready to land.Feb 7 2017, 5:57 PM

While I see rationale in this change, I think it is not completely correct. Setting G_MIRROR_BUMP_SYNCID really bumps syncid only on next write, that may not happen (or happen much later), while disks are already out of sync at the point of this error received.

In D9463#196505, @mav wrote:

While I see rationale in this change, I think it is not completely correct. Setting G_MIRROR_BUMP_SYNCID really bumps syncid only on next write, that may not happen (or happen much later), while disks are already out of sync at the point of this error received.

Oh, that's true. Thank you for pointing out the flaw. Do you have any ideas / suggestions on how that case should be handled?

In D9463#196688, @avg wrote:

Do you have any ideas/ suggestions on how that case should be handled?

My thoughts are now far from this, but just as a thought: I think it would be fine to bump the sync ID immediately on write error, while not necessary update the metadata immediately. Metadata on disks should already have DIRTY flag set while write operation is running, that should protect us in case system crash before we write updated ones.

In D9463#196690, @mav wrote:

In D9463#196688, @avg wrote:

Do you have any ideas/ suggestions on how that case should be handled?

My thoughts are now far from this, but just as a thought: I think it would be fine to bump the sync ID immediately on write error, while not necessary update the metadata immediately. Metadata on disks should already have DIRTY flag set while write operation is running, that should protect us in case system crash before we write updated ones.

If we crash before writing the updated metadata, I believe there's nothing preventing us from using the stale mirror as the master during synchronization, in which case the partially-failed write would be lost.

In D9463#196694, @markj wrote:

In D9463#196690, @mav wrote:

My thoughts are now far from this, but just as a thought: I think it would be fine to bump the sync ID immediately on write error, while not necessary update the metadata immediately. Metadata on disks should already have DIRTY flag set while write operation is running, that should protect us in case system crash before we write updated ones.

If we crash before writing the updated metadata, I believe there's nothing preventing us from using the stale mirror as the master during synchronization, in which case the partially-failed write would be lost.

I've explicitly mentioned DIRTY flag there. It should be set in metadata before first write sent to any disk, and it is what should force re-synchronization after unexpected reboot. If array get idle at some point after that, updated syncid in metadata will be flushed. If not -- possibly may force another flush after some time. I just don't want metadata to be flushed on every error, especially to departing disk, which is probably unavailable any more.

In D9463#196703, @mav wrote:

In D9463#196694, @markj wrote:

In D9463#196690, @mav wrote:

My thoughts are now far from this, but just as a thought: I think it would be fine to bump the sync ID immediately on write error, while not necessary update the metadata immediately. Metadata on disks should already have DIRTY flag set while write operation is running, that should protect us in case system crash before we write updated ones.

If we crash before writing the updated metadata, I believe there's nothing preventing us from using the stale mirror as the master during synchronization, in which case the partially-failed write would be lost.

I've explicitly mentioned DIRTY flag there. It should be set in metadata before first write sent to any disk, and it is what should force re-synchronization after unexpected reboot. If array get idle at some point after that, updated syncid in metadata will be flushed. If not -- possibly may force another flush after some time. I just don't want metadata to be flushed on every error, especially to departing disk, which is probably unavailable any more.

Clearing DIRTY after a 5s idle is a default policy that can be changed. For instance, with cheap flash drives it's sometimes desirable to disable idle metadata updates entirely, as frequent overwrites will shorten the drive lifespan.

In D9463#196505, @mav wrote:

While I see rationale in this change, I think it is not completely correct. Setting G_MIRROR_BUMP_SYNCID really bumps syncid only on next write, that may not happen (or happen much later), while disks are already out of sync at the point of this error received.

Revisiting this review request.
What do you think about adding G_MIRROR_BUMP_SYNCID_IMMEDIATE?
The idea is to bump Sync ID in the same way as Gen ID, that is, as soon as possible. Then, we could set G_MIRROR_BUMP_SYNCID upon a failed read request and G_MIRROR_BUMP_SYNCID_IMMEDIATE for other request types.

immediately bump sync-id on a write failure

This revision now requires review to proceed.Sep 12 2017, 12:15 PM

Harbormaster completed remote builds in B11515: Diff 32960.Sep 12 2017, 12:15 PM

This looks good to my eye

This revision is now accepted and ready to land.Sep 12 2017, 2:05 PM

markj accepted this revision.Sep 13 2017, 2:53 PM

markj added inline comments.

sys/geom/mirror/g_mirror.c
971 ↗	(On Diff #32960)	Suppose we had multiple I/Os queued to the disk, all of which are now returned with bio_error = ENXIO. Am I correct in thinking that this disk != NULL test ensures that we only bump the syncid for the first error?
989 ↗	(On Diff #32960)	Bikeshed: _IM seems like a slightly strange suffix to me. Maybe _NOW instead? Feel free to leave it as-is if you disagree.

use _NOW suffix instead of more cryptic _IM

This revision now requires review to proceed.Sep 15 2017, 1:16 PM

Harbormaster completed remote builds in B11582: Diff 33102.Sep 15 2017, 1:16 PM

avg added inline comments.Sep 15 2017, 1:47 PM

sys/geom/mirror/g_mirror.c
971 ↗	(On Diff #32960)	To be honest, I am not entirely sure... I patched the code without doing a full analysis of it. It seems that `g_mirror_destroy_disk` → `g_mirror_disconnect_consumer` → `g_mirror_kill_consumer` would indeed reset `cp->private` to NULL. And `g_mirror_update_disk(G_MIRROR_DISK_STATE_DISCONNECTED)` calls `g_mirror_destroy_disk`. But that happens asynchronously via an event queue (`g_mirror_event_send`), so there can be multiple events accumulated before the first one is processed. Thus, I think that this check is more generic. Basically, nothing to do if the disk is already gone. But there should be only a single syncid bump indeed. That's because `g_mirror_worker` first calls `g_mirror_update_disk` and then `g_mirror_update_device` for any disk event. So, when the first of the queued `G_MIRROR_DISK_STATE_DISCONNECTED` events is processed two things happen: disk is reset to NULL ensuring no future additional events for the disk `G_MIRROR_BUMP_SYNCID_NOW` is cleared in the sc As NULL disk ensures that `G_MIRROR_BUMP_SYNCID_NOW` won't be set again in `sc`, there will a single syncid bump for all the queued events.