Eliminate sleeps in the geom ATF tests
Needs RevisionPublic
Actions

Authored by asomers on Dec 17 2018, 4:51 PM.

Details

Reviewers

cem

Summary

Eliminate sleeps in the geom ATF tests

Replace sleeps with polling loops in all geom ATF tests (but not the TAP
tests). This also fixes a teardown race in the ggate tests. "ggatec
destroy" apparently returns before the consumer releases the underlying
device, causing the "mdconfig -d" command to occasionally fail during
cleanup.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 21614
Build 20910: arc lint + arc unit

Event Timeline

asomers created this revision.Dec 17 2018, 4:51 PM

Herald added a subscriber: imp. · View Herald TranscriptDec 17 2018, 4:51 PM

Harbormaster completed remote builds in B21614: Diff 52114.Dec 17 2018, 4:51 PM

I have some qualms with the API of the generic interface. I don't think most things care about the time step, and it might even make sense to scale it by something like architecture (as a proxy for CPU performance); it should be left out of the API. Absolute timeout should be optional or in a separate interface. And the condition is needlessly complicated, at least in the gmirror case changed (double negative).

In gmirror, I think this change introduces two regressions (I didn't look at other geom classes), so I'd ask for those to be fixed before committing.

tests/sys/geom/class/geom_subr.sh
78	I don't think most things need to care about the poll interval.
84	Just use `if`
tests/sys/geom/class/mirror/conf.sh
17	1000 is a regression — the previous code was correct, i.e., no timeout (aside from any global test timeout). I don't like the additional complexity added in testing this condition, either.
tests/sys/geom/class/mirror/sync_error.sh
37	I think this may introduce a race condition. This sleep is accounting for the transition to SYNCHRONIZING; syncwait is accounting for the transition from SYNCHRONIZING.

This revision now requires changes to proceed.Dec 17 2018, 6:26 PM

asomers added inline comments.Dec 17 2018, 9:20 PM

tests/sys/geom/class/mirror/conf.sh
17	I think it would be better to use a reasonable timeout for `syncwait`, just like most `wait_for` callers do. That will result in a better error message on failure than using the global test timeout. However, `syncwait` is also used by TAP tests. I don't think the 1000s wait is a problem, because it's longer than the default test timeout. Would you rather I update all of the TAP tests to use a reasonable timeout?
tests/sys/geom/class/mirror/sync_error.sh
37	Then what, pray tell, should the status be before it becomes `SYNCHRONIZING`?

cem added inline comments.Dec 17 2018, 10:14 PM

tests/sys/geom/class/mirror/conf.sh
17	I think it would be better to use a reasonable timeout for syncwait The problem with that statement that there is no "reasonable timeout" for mirror synchronization. I don't think the 1000s wait is a problem, because it's longer than the default test timeout. I don't think we should hardcode assumptions about the current, default test timeout.
tests/sys/geom/class/mirror/sync_error.sh
37	Empirically, "COMPLETE".

asomers added inline comments.Dec 17 2018, 10:40 PM

tests/sys/geom/class/mirror/conf.sh
17	I don't think we should hardcode assumptions about the current, default test timeout. Then how about `INT_MAX` instead of 1000?
tests/sys/geom/class/mirror/sync_error.sh
37	Wait, the status is `COMPLETE` _before_ it changes to `SYNCHRONIZING`? That sounds like a bug in gmirror. Is there a generation count or some other way for us to tell that the rebuild hasn't started yet? ZFS does this by using the `DEGRADED` pool status when a disk is missing.

cem added inline comments.Dec 17 2018, 10:56 PM

tests/sys/geom/class/mirror/conf.sh
17	I think INT_MAX is a poor way to spell optional, especially in shell. I'm not sure how wide the `$((a+b))` calculation in wait_for is performed at, but it seems like INT_MAX might cause overflow there on (?)32-bit platforms.
tests/sys/geom/class/mirror/sync_error.sh
37	Look at the code in the test. The gmirror had 1/1 component before `$md2` was added to the array. `COMPLETE` is accurate — there is no missing disk. Yes, there is a mirrorset generation number in the kernel that is incremented when components are added. You'd have to investigate how to look at that value from userspace. Are you similarly unfamiliar with the other GEOM components you're changing in this revision? If so, that seems problematic for the rest of the revision.

ngie added inline comments.Dec 19 2018, 1:55 AM

tests/sys/geom/class/mirror/conf.sh
17	`-1` would be fine though, right?

ngie added inline comments.Dec 19 2018, 2:05 AM

tests/sys/geom/class/mirror/conf.sh
17	@cem: I think I see what you mean. @asomers's concern might be coming from a place of there being a problem with a global timeout being exceeded in TAP testcases, whereas your concern is over the complexity required in both cases? The right solution in this case would probably be to convert the TAP tests to ATF tests, since it's in an intermediary state (mix of ATF and TAP tests using the same underlying logic), reducing the need for having to worry about a global timeout. This intermediary state is causing an unnecessary split in terms of logic, forcing there to be unnecessary complexity (here and elsewhere) :/.
tests/sys/geom/class/mirror/sync_error.sh
37	@cem: I think @asomers addressing the practicality/determinant behavior of the sleep seems valid. What happens if the timeout (0.1) was incorrect, e.g., skewed or needed to be increased, such that the state that syncwait was expecting as initial state was not the expected state?