Page MenuHomeFreeBSD

gmirror: treat ENXIO as disk disconnect, not media error
ClosedPublic

Authored by avg on Feb 6 2017, 6:51 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 28, 3:16 AM
Unknown Object (File)
Wed, Nov 27, 1:30 AM
Unknown Object (File)
Tue, Nov 26, 8:30 AM
Unknown Object (File)
Tue, Nov 26, 6:31 AM
Unknown Object (File)
Thu, Nov 21, 10:22 PM
Unknown Object (File)
Thu, Nov 14, 11:50 PM
Unknown Object (File)
Thu, Nov 7, 4:13 AM
Unknown Object (File)
Oct 8 2024, 7:49 PM
Subscribers

Details

Summary

In theory, all data access errors mean that a member is out of sync
at most. But they were treated as more serious errors to avoid the
situation where a flaky disk gets repeatedly disconnected, re-synchronized,
reconnected and then disconnected again.

ENXIO is a special error that means that the member disk disappeared,
so it should get the same handling as the GEOM orphaning event.
There is a better chance that when the disk is reconnected, it will be
a good member again.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 11582
Build 11935: arc lint + arc unit

Event Timeline

avg retitled this revision from to gmirror: treat ENXIO as disk disconnect, not media error.
avg updated this object.
avg edited the test plan for this revision. (Show Details)
avg added reviewers: mav, markj, pjd.
sys/geom/mirror/g_mirror.c
985–991

There is similar code for handling errors from metadata updates and synchronization I/O. Do we want similar handling there?

sys/geom/mirror/g_mirror.c
985–991

I think that the metadata update error is certainly a more grave error as the affected member would have a mismatching configuration on disk. I haven't made up my mind about the synchronization error, so I'd prefer to keep the status quo for it.

markj edited edge metadata.
This revision is now accepted and ready to land.Feb 7 2017, 5:57 PM

While I see rationale in this change, I think it is not completely correct. Setting G_MIRROR_BUMP_SYNCID really bumps syncid only on next write, that may not happen (or happen much later), while disks are already out of sync at the point of this error received.

In D9463#196505, @mav wrote:

While I see rationale in this change, I think it is not completely correct. Setting G_MIRROR_BUMP_SYNCID really bumps syncid only on next write, that may not happen (or happen much later), while disks are already out of sync at the point of this error received.

Oh, that's true. Thank you for pointing out the flaw. Do you have any ideas / suggestions on how that case should be handled?

In D9463#196688, @avg wrote:

Do you have any ideas/ suggestions on how that case should be handled?

My thoughts are now far from this, but just as a thought: I think it would be fine to bump the sync ID immediately on write error, while not necessary update the metadata immediately. Metadata on disks should already have DIRTY flag set while write operation is running, that should protect us in case system crash before we write updated ones.

In D9463#196690, @mav wrote:
In D9463#196688, @avg wrote:

Do you have any ideas/ suggestions on how that case should be handled?

My thoughts are now far from this, but just as a thought: I think it would be fine to bump the sync ID immediately on write error, while not necessary update the metadata immediately. Metadata on disks should already have DIRTY flag set while write operation is running, that should protect us in case system crash before we write updated ones.

If we crash before writing the updated metadata, I believe there's nothing preventing us from using the stale mirror as the master during synchronization, in which case the partially-failed write would be lost.

In D9463#196694, @markj wrote:
In D9463#196690, @mav wrote:

My thoughts are now far from this, but just as a thought: I think it would be fine to bump the sync ID immediately on write error, while not necessary update the metadata immediately. Metadata on disks should already have DIRTY flag set while write operation is running, that should protect us in case system crash before we write updated ones.

If we crash before writing the updated metadata, I believe there's nothing preventing us from using the stale mirror as the master during synchronization, in which case the partially-failed write would be lost.

I've explicitly mentioned DIRTY flag there. It should be set in metadata before first write sent to any disk, and it is what should force re-synchronization after unexpected reboot. If array get idle at some point after that, updated syncid in metadata will be flushed. If not -- possibly may force another flush after some time. I just don't want metadata to be flushed on every error, especially to departing disk, which is probably unavailable any more.

In D9463#196703, @mav wrote:
In D9463#196694, @markj wrote:
In D9463#196690, @mav wrote:

My thoughts are now far from this, but just as a thought: I think it would be fine to bump the sync ID immediately on write error, while not necessary update the metadata immediately. Metadata on disks should already have DIRTY flag set while write operation is running, that should protect us in case system crash before we write updated ones.

If we crash before writing the updated metadata, I believe there's nothing preventing us from using the stale mirror as the master during synchronization, in which case the partially-failed write would be lost.

I've explicitly mentioned DIRTY flag there. It should be set in metadata before first write sent to any disk, and it is what should force re-synchronization after unexpected reboot. If array get idle at some point after that, updated syncid in metadata will be flushed. If not -- possibly may force another flush after some time. I just don't want metadata to be flushed on every error, especially to departing disk, which is probably unavailable any more.

Clearing DIRTY after a 5s idle is a default policy that can be changed. For instance, with cheap flash drives it's sometimes desirable to disable idle metadata updates entirely, as frequent overwrites will shorten the drive lifespan.

In D9463#196505, @mav wrote:

While I see rationale in this change, I think it is not completely correct. Setting G_MIRROR_BUMP_SYNCID really bumps syncid only on next write, that may not happen (or happen much later), while disks are already out of sync at the point of this error received.

Revisiting this review request.
What do you think about adding G_MIRROR_BUMP_SYNCID_IMMEDIATE?
The idea is to bump Sync ID in the same way as Gen ID, that is, as soon as possible. Then, we could set G_MIRROR_BUMP_SYNCID upon a failed read request and G_MIRROR_BUMP_SYNCID_IMMEDIATE for other request types.

immediately bump sync-id on a write failure

This revision now requires review to proceed.Sep 12 2017, 12:15 PM

This looks good to my eye

This revision is now accepted and ready to land.Sep 12 2017, 2:05 PM
markj added inline comments.
sys/geom/mirror/g_mirror.c
971

Suppose we had multiple I/Os queued to the disk, all of which are now returned with bio_error = ENXIO. Am I correct in thinking that this disk != NULL test ensures that we only bump the syncid for the first error?

989

Bikeshed: _IM seems like a slightly strange suffix to me. Maybe _NOW instead? Feel free to leave it as-is if you disagree.

use _NOW suffix instead of more cryptic _IM

This revision now requires review to proceed.Sep 15 2017, 1:16 PM
sys/geom/mirror/g_mirror.c
971

To be honest, I am not entirely sure... I patched the code without doing a full analysis of it.
It seems that g_mirror_destroy_diskg_mirror_disconnect_consumerg_mirror_kill_consumer would indeed reset cp->private to NULL. And g_mirror_update_disk(G_MIRROR_DISK_STATE_DISCONNECTED) calls g_mirror_destroy_disk. But that happens asynchronously via an event queue (g_mirror_event_send), so there can be multiple events accumulated before the first one is processed.

Thus, I think that this check is more generic. Basically, nothing to do if the disk is already gone.

But there should be only a single syncid bump indeed. That's because g_mirror_worker first calls g_mirror_update_disk and then g_mirror_update_device for any disk event. So, when the first of the queued G_MIRROR_DISK_STATE_DISCONNECTED events is processed two things happen:

  • disk is reset to NULL ensuring no future additional events for the disk
  • G_MIRROR_BUMP_SYNCID_NOW is cleared in the *sc*

As NULL disk ensures that G_MIRROR_BUMP_SYNCID_NOW won't be set again in sc, there will a single syncid bump for all the queued events.

sys/geom/mirror/g_mirror.c
971

Right, ok. I went through the same analysis, so I believe this is correct.

This revision was automatically updated to reflect the committed changes.