Fix gmirror BIO_FLUSH and sync collision queuing.

Authored by markj on Wed, Dec 20, 7:50 PM.



This revision aims to address a few BIO ordering bugs:

BIO_FLUSH requests are dispatched directly to the mirrors from
g_mirror_start() rather than being queued for processing by the gmirror
worker thread. This means that a BIO_FLUSH may "jump ahead" of a
preceding BIO_WRITE, which is undesirable. Address this by handling
BIO_FLUSH requests the same as BIO_{READ,WRITE,DELETE}. We take care to
avoid breaking mirrors if a BIO_FLUSH comes back with EOPNOTSUPP.

When a BIO_WRITE or BIO_DELETE request overlaps with an offset that's
actively being synchronized (i.e., we're currently reading at that
offset in order to copy its contents to a synchronizing mirror, or we're
currently writing to that offset as part of a block copy), the request
is placed in the regular_delayed queue. Once the colliding
synchronization BIO completes, BIOs from the delayed queue are put back
on the head of the work queue. However, this allows writes to be
reordered when synchronizing. Fix this by freezing all I/O requests
(including reads) when a collision occurs.

The regular_delayed queue may reorder reads and writes to the same
offset, since reads never went into the regular_delayed queue. Fix this
the same as above: once a BIO_WRITE or BIO_DELETE collides with a
synchronization BIO, all subsequent regular I/Os are placed in the
regular_delayed queue until that initial collision is over, at which
point all BIOs in regular_delayed are moved back to the main I/O queue.

Test Plan

Peter Holm ran stress2 scenarios with this patch applied.

In-tree tests pass.

Diff Detail

rS FreeBSD src repository
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.
markj created this revision.Wed, Dec 20, 7:50 PM
markj edited the summary of this revision. (Show Details)Wed, Dec 20, 7:53 PM
markj edited the test plan for this revision. (Show Details)
imp added a subscriber: imp.Wed, Dec 20, 11:12 PM

My first reading I like it. Don't have the time to give it a very close reading though.

1980 ↗(On Diff #36825)

I love these comments. It always takes me time to recall what each of these cases were for...

cem added a subscriber: cem.Wed, Dec 20, 11:57 PM
cem added a reviewer: cem.Fri, Dec 22, 2:36 PM

I've only looked at this first half of this patch. Most of my comments aren't requests for any change, but clarifying questions for me, as someone not very familiar with gmirror. Gotta run, will look more later.

920–921 ↗(On Diff #36825)

Should we flag disks broken for read errors? Or at least, we could track this with two different flags, if it's useful. Edit: Oh, I see, this is a straightforward extraction of some code below into a subroutine. Well, feel free to ignore this functional change question then.

930–936 ↗(On Diff #36825)

What do these different bumps do, and why do we want different ones for each of these cases? Might be nice to comment for folks unfamiliar with gmirror (me :-)). (Edit: I see this is just a straightforward extraction into subroutine of earlier code, but it'd still be nice to document while you're here :-).)

977 ↗(On Diff #36825)

Nice! More failpoints is a good thing.

988–996 ↗(On Diff #36825)

(Trivial style nit: this could be extracted as a subroutine.)

1018–1020 ↗(On Diff #36825)

Why do we have different completion process for the very special case of a gmirror with a single disk? It seems like we shouldn't special case it.

1040–1041 ↗(On Diff #36825)

That's an interesting interpretation of some requests succeeded. Are we starting to kick out the broken mirrors for WRITE/DELETE at this point? I suppose for disks that don't handle FLUSH it may be ok to keep them?

markj added inline comments.Fri, Dec 22, 3:57 PM
920–921 ↗(On Diff #36825)

I'm not sure yet. In OneFS 7 or earlier we actually added a state, "unreadable," for this case. We attempt to re-synchronize unreadable mirrors rather than kick them out, in the hope that overwriting the bad sector(s) will allow reads to succeed again. I'm undecided as to whether that change should be upstreamed; it's untested AFAIK and has probably rotted somewhat in the BSD10/11 merges. There's an internal bug assigned to me to investigate further and sync with upstream.

930–936 ↗(On Diff #36825)

The BUGS section of the gmirror man page says, "There should be a section with an implementation description." :) I started writing one but it's not ready yet.

The bump flags just tell the gmirror worker thread to increment a pair of generation counters associated with the gmirror: the genid (bumped when a mirror is kicked out as a result of errors) and the syncid (bumped when the gmirror is known to have stale mirrors). ENXIO is handled with a syncid bump: see r323612.

977 ↗(On Diff #36825)

Yup, they've been pretty handy for writing tests.

1018–1020 ↗(On Diff #36825)

What's happening here is that a read from one of the mirrors failed. If there's only one active mirror left, it means that we can't read at all and so we return the error to upper layers. If there's more than one, we effect a retry from a different mirror by requeuing the read BIO.

1040–1041 ↗(On Diff #36825)

Yup, g_mirror_regular_request_error() kicked out the mirror that returned an error. Note that I specifically short-circuit the case where BIO_FLUSH came back with EOPNOTSUPP - we don't want to kick the mirrors out in that case at least.

cem added inline comments.Tue, Dec 26, 5:44 PM
1018–1020 ↗(On Diff #36825)

Oh, ok.

Does something enforce that the bio actually goes to a different disk the 2nd time around?

markj added inline comments.Tue, Dec 26, 8:23 PM
1018–1020 ↗(On Diff #36825)

Yes. g_mirror_regular_request_error() will kick out the disk that returned the error (or at least flag it in some way), and we'll only select a healthy mirror when the retry request comes around.

imp added a comment.Wed, Dec 27, 6:37 PM

A question: Does it make sense to 'resilver' those sectors that return EIO rather than simply kicking the disk to the curb and only kicking it to the curb if the resilver fails to readback correctly?

markj added a comment.Thu, Dec 28, 2:13 AM
In D13559#285371, @imp wrote:

A question: Does it make sense to 'resilver' those sectors that return EIO rather than simply kicking the disk to the curb and only kicking it to the curb if the resilver fails to readback correctly?

Yes, I think so. The gmirror mods in OneFS already do this in a kind of half-baked way (I didn't write them :)); it's related to the "unreadable" state I mentioned in another comment in this review. I'm planning to rework that for upstream, along with a mechanism to proactively scan for such sectors. In particular, I'd like to add a read-balancing algorithm which attempts to read the requested sector(s) from all mirrors rather than just one, and be able to temporarily switch to that for the purpose of reading back the whole gmirror with dd(1) or so.

markj added a comment.Tue, Jan 2, 3:49 PM

I plan to commit this in the next day or two if there are no objections.

imp accepted this revision.Tue, Jan 2, 4:42 PM

I'm OK with this. It's an improvement and clearly better w/o any new issues I can spot.

I'm a little uneasy still about gmirror, but to erase that discomfort is too high a bar for me to set for this review :)

This revision is now accepted and ready to land.Tue, Jan 2, 4:42 PM
markj added a comment.Tue, Jan 2, 5:03 PM
In D13559#287445, @imp wrote:

I'm a little uneasy still about gmirror, but to erase that discomfort is too high a bar for me to set for this review :)

Hm, are there any specific issues you can describe here? I plan to continue addressing problems in gmirror.

This revision was automatically updated to reflect the committed changes.