Page MenuHomeFreeBSD

gmirror: Evaluate mirror components against newest metadata copy
ClosedPublic

Authored by cem on Nov 21 2018, 6:42 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Dec 25, 12:27 AM
Unknown Object (File)
Tue, Dec 24, 11:19 AM
Unknown Object (File)
Tue, Dec 24, 5:06 AM
Unknown Object (File)
Fri, Dec 20, 4:56 PM
Unknown Object (File)
Fri, Dec 20, 4:36 PM
Unknown Object (File)
Fri, Dec 20, 4:33 PM
Unknown Object (File)
Fri, Dec 20, 4:23 PM
Unknown Object (File)
Fri, Dec 20, 4:21 PM

Details

Summary

If we happen to taste a stale mirror component first, don't reject valid,
newer components that have differing metadata from the stale component
(during STARTING). Instead, update our view of the most recent metadata and
remove any stale disks as new components are tasted.

This could have manifested as a stale mirror component with mismatched
'md_all' causing g_mirror_check_metadata to reject valid mirrors.

Like mediasize beforehand, remove all checks from g_mirror_check_metadata
that reflect mirror metadata that can change over the lifetime of the
mirror. None of them are reason to bounce a good mirror with newer
generation than existing components.

Additionally, add a knob, kern.geom.mirror.launch_mirror_before_timeout, to
force gmirrors to wait out the full timeout (kern.geom.mirror.timeout)
before transitioning from STARTING to RUNNING. This is a kludge to help
ensure all mirror components are tasted before RUNNING a gmirror.

When we are instructed to forget mirror components, bump the generation id
to avoid confusion with such stale components later.

PR: 232835, (related-but-does-not-fix: 232671)
Submitted by: Cindy Yang <cyang AT isilon.com>
Sponsored by: Dell EMC Isilon

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 20916
Build 20281: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/geom/mirror/g_mirror.c
2558

I believe this reintroduces the bug fixed in r306743.

3117

Why? Recomputing it is cheap.

cem marked 8 inline comments as done.Nov 25 2018, 8:24 PM
cem added inline comments.
sys/geom/mirror/g_mirror.c
2558

I don't think it does — the kick-out of stale components, and marking for SYNCID bump were just relocated from this function to g_mirror_add_disk.

3117

It just makes the code clearer; it's not about the computation cost. It could be replaced with some more complicated expression (LIST_FIRST(&sc->sc_disks)->d_genid) in g_mirror_refresh_device but that wouldn't really improve the code, as far as I can tell.

3538

Will fix

3546

Yeah, me too (I'm not the original author of the patch). I tried to avoid making too stylistic changes (other than minimal style(9) stuff) to the Isilon patch when uploading to avoid adding conflict when we merge this back. I'm happy to make the change if you want, though.

3559

I agree and suggested it in the internal code review ;-). It just didn't happen. I'll make the change.

3564

It can't be changed during the lifetime of a gmirror; the rest of these can.

sys/geom/mirror/g_mirror.c
2558

Suppose g_launch_mirror_before_timeout is true. The g_mirror_event_send() call in g_mirror_add_disk() will start the mirror once a second mirror is found. The event_send is synchronous, so the mirror's state will be updated to RUNNING before we perform the genid check on the last discovered component. In particular, in a mirror with two components, we'll always end up adding the broken component to the mirror, so the bug being introduced is actually larger than the one from r306743.

3564

How can mediasize change?

Ideally this would come with a regression test. There are some in tests/sys/geom/class/mirror/; they use fail points to trigger errors. (They need to be converted to gnop at some point: don't run the tests on a system that actually uses gmirror. :/)

cem marked 4 inline comments as done.Nov 25 2018, 10:20 PM

Ideally this would come with a regression test.

Agreed. I am happy to hold off on committing this until such a test has been created.

There are some in tests/sys/geom/class/mirror/; they use fail points to trigger errors. (They need to be converted to gnop at some point: don't run the tests on a system that actually uses gmirror. :/)

Ah, I was kind of hoping they compiled GEOM into a userspace simulation instead. I need to figure out how to run a bhyve test VM for this.

sys/geom/mirror/g_mirror.c
2558

Oh, I see. I think the sc->sc_state == G_MIRROR_DEVICE_STATE_RUNNING condition (line 2950~2967 or so in the current incarnation of the diff) needs to be removed from g_mirror_add_disk to prevent us from adding the broken m2 component after initially discovering m1 in the r306743 scenario.

For the ordering where m2 is tasted before m1, we either need to run the kickout loop before event_send, or g_mirror_ndisks could be modified not to count stale components. My concern with the former option would be the mirror having zero components (ephemerally). I'm not sure there's any reason to be concerned about the latter option. I suppose a 3rd option would be to just temporarily bump sc_ndisks while the event_send and kickout loop runs in g_mirror_add_disk, and decrement it afterwards.

3564

g_mirror_ctl_resize (r258357). It's why the mediasize check was disabled in g_mirror_check_metadata (although I don't understand why it wasn't just deleted).

  • Fix indentation
  • Common subroutine for softc field initialization from md
cem marked an inline comment as done.Nov 26 2018, 5:48 PM
cem added inline comments.
sys/sys/param.h
63 ↗(On Diff #51128)

Ah fuck, I rebased my patch series wrong. Ignore this.

cem marked 4 inline comments as done.Nov 26 2018, 7:52 PM
cem added inline comments.
sys/geom/mirror/g_mirror.c
2558

FWIW, I should mention that I introduced this bug (deleting the "Remove all disks without the biggest genid" from g_mirror_device_update) in my own modification to Cindy's patch — it's not present in the patch she submitted. (Blame where blame is due 🙂.)

And I should also add that a 4th option would be to just restore the kickout loop to g_mirror_update_device, perhaps via common subroutine.

sys/geom/mirror/g_mirror.c
2508

Extra #if

3117

I mean, you can just loop over the disks each time, which is what we do in the INVARIANTS block anyway. Keeping it in the softc is just more state to keep track of, but I don't feel strongly about it.

3544

Extra newline.

3546

I'd prefer to have the suggested change; in these situations I'd just do the merge back myself while the change is still fresh, rather than waiting for the tip of the merge branch to catch up.

cem planned changes to this revision.Nov 27 2018, 5:21 AM
cem marked 8 inline comments as done.
cem added inline comments.
sys/geom/mirror/g_mirror.c
2508

Sure. I meant this bit to be a discussion point.

If we got here, we have some number of disks, all with the same genid. The one with the latest syncid says sc_ndisks is N. If there are other disk(s) with the same genid and sc_ndisks != N, do we want to kick them out? That was the existing behavior (via g_mirror_check_metadata).

I guess preserving the existing behavior for now makes some sense (smaller bite-size change) even if it isn't the right long-term behavior. I'm happy to drop the extraneous #if and comment(s).

2558

I'll move the kick-out logic back to g_mirror_device_update, but locate it prior to the auto-start logic.

cem marked an inline comment as done.
  • Restore stale component kick-out to g_mirror_device_update, but relocate it *prior* to determining whether or not we have quorum to RUN the mirrorset.
  • Perform sanity check assertions re: accumulated sc_genid and sc_syncid at the same time.
  • For now, continue to kill mismatched md_all components *of the quorum genid*. Newer genid md_all components will override mirrorset configuration at taste time.
  • Skip stale mirrors during g_mirror_add_disk prior to attempting to refresh from them.
  • Remove stale component kick-out from g_mirror_add_disk. As Markj@ points out, g_mirror_add_disk synchronously invokes g_mirror_disk_update and g_mirror_device_update via g_mirror_event_send() in the middle of g_mirror_add_disk. So we might as well centralize back in g_mirror_device_update.
  • Simplify early exit logic in g_mirror_refresh_device(), per markj@.
sys/geom/mirror/g_mirror.c
2508

I think the existing behaviour was reasonable. At least, I can't think of a scenario where it'd be harmful.

But now we're being kind of inconsistent - shouldn't we do the same if, e.g. the balance algorithm is different?

2509

Why is the bump necessary? Shouldn't we be able to assume that the inconsistent disk already has a lower syncid?

sys/geom/mirror/g_mirror_ctl.c
985–994

What's the point of this part of the change? If g_mirror_ndisks(sc, -1) != sc->sc_ndisks, then we started with a missing disk, in which case we must have bumped the syncid on the remaining disks, so we will not use the lost disk's metadata if it reappears.

sys/geom/mirror/g_mirror.c
2509

I added this bit b/c during testing I saw syncid did not get automatically bumped if a disk failed to add to a running mirror during the early part of a rebalance. If the node got rebooted at this very instant, the valid components and disconnected component would have the same genid and syncid.

The 'forget' ioctl allows us to force gmirror to increment the genid from the userspace. The ioctl anyway updates sc->ndisks when a mirror is degraded, so it makes sense to update genid as well.

cem marked 3 inline comments as done.Nov 28 2018, 3:08 AM
cem added inline comments.
sys/geom/mirror/g_mirror.c
2508

Sure, I guess we could reject mismatched balance and slice here as well. (And mediasize?)

2509

I don't see any reason to assume the inconsistent disk has lower syncid, and the bump is pretty harmless (one-time write). If we can assume the inconsistent disk has lower syncid, we might as well KASSERT it.

sys/geom/mirror/g_mirror_ctl.c
985–994

I don't know the exact motivation for this change. Cindy?

g_mirror_ndisks(sc, -1) != sc->sc_ndisks can sometimes occur at runtime, when a disk disappears (STATE_DISCONNECTED) or is manually removed (g_mirror_ctl_remove). But it isn't clear to me if that can race with this function or not. If it is an invariant, maybe we should MPASS(sc->sc_state == G_MIRROR_DEVICE_STATE_STARTING);.

sys/geom/mirror/g_mirror_ctl.c
985–994

Argh ... my previous reply was misaligned. Here you go:

I added this bit b/c during testing I saw syncid did not get automatically bumped if a disk failed to add to a running mirror during the early part of a rebalance. If the node got rebooted at this very instant, the valid components and disconnected component would have the same genid and syncid. The 'forget' ioctl allows us to force gmirror to increment the genid from the userspace. The ioctl anyway updates sc->ndisks when a mirror is degraded, so it makes sense to update genid as well.

sys/geom/mirror/g_mirror.c
2509

They have contradictory metadata. It shouldn't be possible for them to have the same syncid and genid unless something other than gmirror was messing with the metadata block. At least, if they do, then there is a bug elsewhere in gmirror.

sys/geom/mirror/g_mirror_ctl.c
985–994

Sorry for being picky, but gmirror has historically been pretty tricky to reason about, so I really want to make sure I understand the intent of each change.

Can you elaborate on what "failed to add" means? In general, any time a mirror is in a degraded state, the remaining components should have a larger genid or syncid than the missing components. If a component is kicked out while the mirror is running, we should bump the genid. If the mirror is started with one or more components missing, we bump the syncid (upon the first write to the mirror).

Unrelated to that, the condition in the if statement is kind of weird. How can we have a running mirror with 0 active disks? Why do we not just always bump the genid?

The 'forget' ioctl allows us to force gmirror to increment the genid from the userspace.

Currently it does that only if the mirror is degraded, so "force" isn't really true.

The ioctl anyway updates sc->ndisks when a mirror is degraded, so it makes sense to update genid as well.

The "insert" verb also updates sc->ndisks but doesn't update genid. I agree that there is no harm in updating the genid here, but I don't see the point of it given that this change also adds code to kick out mirrors with inconsistent md_all fields.

sys/geom/mirror/g_mirror_ctl.c
985–994

No problem. Thanks for reviewing this patch.

The "failed to add" scenario actually came out of a field issue. On a factory fresh node, during early boot a couple of SEDs dropped out while gmirror tried to replicate/sync the root device from the primary copy onto the secondary copy. It happened twice in a row and finally the 3rd attempt succeeded and the root mirror transitioned to RUNNING.

The data I saw was that one of the disconnected components had the same genid as the that of the valid ones. I think you're correct w.r.t. using the highest syncid as a tie breaker to pick out the correct component.

All we are interested is that genid gets incremented when the mirror is degraded. As our userspace daemon runs the rebalance routine as often as it can, forget() is called to proactively update genid as needed. This is a nice trick for gmirror to outright reject old component on the next boot based on genid comparison.

I think the g_mirror_ndisks(sc, G_MIRROR_DISK_STATE_ACTIVE) > 0 is me being paranoid and trying to avoid hitting the assert in g_mirror_bump_genid().

cem marked 3 inline comments as done.Nov 28 2018, 9:44 PM
cem added inline comments.
sys/geom/mirror/g_mirror.c
2509

I guess all of the metadata is encoded in a single disk block and locks are held such that the genid bump should happen simultaneously with metadata updates. Operative word being "should;" it wouldn't surprise me if there are bugs remaining in gmirror :-).

I guess we should move this kick-out loop above auto-start, like the genid one (and add the other metadata fields that can change at runtime). Should we actually kick out disks with inconsistent metadata if they have a lower syncid, or just dirty them?

cem marked an inline comment as done.Dec 5 2018, 1:10 AM
cem added inline comments.
sys/geom/mirror/g_mirror_ctl.c
985–994

Should I drop this section, at least for the initial upstream revision?

Move metadata kickout loop above quorum check.

Testing notes: I ran the existing regression suite and did not observe any new
regressions. I still need to add some tests.

sys/geom/mirror/g_mirror_ctl.c
985–994

Sounds good to me.

markj added inline comments.
sys/geom/mirror/g_mirror.c
66

"before launching mirrors"

sys/geom/mirror/g_mirror_ctl.c
985–994

Thanks, I'd also prefer to drop it for this revision, as it's logically somewhat separate.

This revision is now accepted and ready to land.Dec 5 2018, 5:25 PM
cem marked 3 inline comments as done.
  • gmirrors -> mirrors in sysctl description
  • drop g_mirror_ctl.c change for now
This revision now requires review to proceed.Dec 5 2018, 6:28 PM
markj added a subscriber: pho.

@pho has a number of gmirror stress tests. Peter, do you mind giving this patch a try?

This revision is now accepted and ready to land.Dec 5 2018, 7:51 PM

@pho has a number of gmirror stress tests. Peter, do you mind giving this patch a try?

I have started the tests.

In D18062#392769, @pho wrote:

I have started the tests.

Thanks, Peter!

Add a basic test case showing we evict stale mirror components (on the basis of
genid, anyway).

This revision now requires review to proceed.Dec 6 2018, 2:00 AM
cem marked an inline comment as done.Dec 6 2018, 2:23 AM
cem added inline comments.
tests/sys/geom/class/mirror/component_selection.sh
100–110 ↗(On Diff #51635)

For reasons I'm not sure of, in the kyua environment this times out; it seems the sub-shell process is stuck in ttyread for some reason. In atf-sh, of course, it runs fine.

tests/sys/geom/class/mirror/component_selection.sh
43 ↗(On Diff #51635)

What is the purpose of this sleep? Sleeps in tests are very dangerous. You should figure out how to eliminate them.

cem marked an inline comment as done.Dec 6 2018, 2:51 AM
cem added inline comments.
tests/sys/geom/class/mirror/component_selection.sh
43 ↗(On Diff #51635)

Everything geom does is asynchronous. The sleep gives it a chance to settle in the background, even if it is logging at a high debug level to a relatively slow console device.

I agree that the sleep represents a hacky way to wait for an asynchronous event, and may timeout before that event actually fires, but I don't really follow how that makes them very dangerous.

(Tiny sleeps are commonplace in existing geom tests and I don't think eliminating the need for them is within thee scope of this revision.)

I'm much more concerned with why cleanup() times out in kyua, but not atf-sh.

Fix brain-o in cleanup leading to tty-read stall / kyua failure.

Now the test passes in kyua.

I have not observed any problems with D18062.51638.diff
LGTM.

Test code:

  • Remove extraneous sleeps
  • Check that generation ids compare in expectated order
  • Add additional tests that mirror components are evicted during STARTING, prior to 'gmirror status', via geom reference counts.
  • Use gmirror label -h and insert -h instead of gnop -o 512

(The test continues to pass.)

tests/sys/geom/class/mirror/component_selection.sh
64 ↗(On Diff #51681)

Reducing the value of the sleeps only makes it more likely that the test will fail someday, and no one will know why. Better to remove all of the sleeps and replace them with short polling loops.

cem marked an inline comment as done.Dec 6 2018, 10:04 PM
cem added inline comments.
tests/sys/geom/class/mirror/component_selection.sh
64 ↗(On Diff #51681)

I think that's out of scope for this change.

tests/sys/geom/class/mirror/component_selection.sh
64 ↗(On Diff #51681)

How is it out of scope? This is new code we're talking about, not a bug fix to existing code.

cem marked an inline comment as done.Dec 6 2018, 10:20 PM
cem added inline comments.
tests/sys/geom/class/mirror/component_selection.sh
64 ↗(On Diff #51681)

The test code is adjacent to a bugfix to existing code.

Several other tests in this component use short sleeps to yield / allow geom to taste/settle. Adding some polling mechanism expands the scope of work required. The point of the test is to demonstrate the bug is fixed, not to be glorious code in its own right.

I can leave the test disabled if that's really your preference... you might also want to disable 6_test.sh and sync_error.sh.

This revision was not accepted when it landed; it landed in state Needs Review.Dec 6 2018, 11:56 PM
This revision was automatically updated to reflect the committed changes.