Page MenuHomeFreeBSD

Eliminate sleeps in the geom ATF tests
Needs RevisionPublic

Authored by asomers on Dec 17 2018, 4:51 PM.
Tags
None
Referenced Files
Unknown Object (File)
Dec 15 2024, 7:15 PM
Unknown Object (File)
Nov 24 2024, 10:53 AM
Unknown Object (File)
Nov 9 2024, 3:16 PM
Unknown Object (File)
Nov 3 2024, 2:38 AM
Unknown Object (File)
Oct 3 2024, 2:22 PM
Unknown Object (File)
Oct 2 2024, 12:34 PM
Unknown Object (File)
Sep 5 2024, 2:25 AM
Unknown Object (File)
Aug 6 2024, 9:24 PM
Subscribers

Details

Reviewers
cem
Summary

Eliminate sleeps in the geom ATF tests

Replace sleeps with polling loops in all geom ATF tests (but not the TAP
tests). This also fixes a teardown race in the ggate tests. "ggatec
destroy" apparently returns before the consumer releases the underlying
device, causing the "mdconfig -d" command to occasionally fail during
cleanup.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 21614
Build 20910: arc lint + arc unit

Event Timeline

cem requested changes to this revision.Dec 17 2018, 6:26 PM

I have some qualms with the API of the generic interface. I don't think most things care about the time step, and it might even make sense to scale it by something like architecture (as a proxy for CPU performance); it should be left out of the API. Absolute timeout should be optional or in a separate interface. And the condition is needlessly complicated, at least in the gmirror case changed (double negative).

In gmirror, I think this change introduces two regressions (I didn't look at other geom classes), so I'd ask for those to be fixed before committing.

tests/sys/geom/class/geom_subr.sh
78

I don't think most things need to care about the poll interval.

84

Just use if

tests/sys/geom/class/mirror/conf.sh
17

1000 is a regression โ€” the previous code was correct, i.e., no timeout (aside from any global test timeout).

I don't like the additional complexity added in testing this condition, either.

tests/sys/geom/class/mirror/sync_error.sh
37

I think this may introduce a race condition. This sleep is accounting for the transition to SYNCHRONIZING; syncwait is accounting for the transition from SYNCHRONIZING.

This revision now requires changes to proceed.Dec 17 2018, 6:26 PM
tests/sys/geom/class/mirror/conf.sh
17

I think it would be better to use a reasonable timeout for syncwait, just like most wait_for callers do. That will result in a better error message on failure than using the global test timeout. However, syncwait is also used by TAP tests. I don't think the 1000s wait is a problem, because it's longer than the default test timeout. Would you rather I update all of the TAP tests to use a reasonable timeout?

tests/sys/geom/class/mirror/sync_error.sh
37

Then what, pray tell, should the status be before it becomes SYNCHRONIZING?

tests/sys/geom/class/mirror/conf.sh
17

I think it would be better to use a reasonable timeout for syncwait

The problem with that statement that there is no "reasonable timeout" for mirror synchronization.

I don't think the 1000s wait is a problem, because it's longer than the default test timeout.

I don't think we should hardcode assumptions about the current, default test timeout.

tests/sys/geom/class/mirror/sync_error.sh
37

Empirically, "COMPLETE".

tests/sys/geom/class/mirror/conf.sh
17

I don't think we should hardcode assumptions about the current, default test timeout.

Then how about INT_MAX instead of 1000?

tests/sys/geom/class/mirror/sync_error.sh
37

Wait, the status is COMPLETE _before_ it changes to SYNCHRONIZING? That sounds like a bug in gmirror. Is there a generation count or some other way for us to tell that the rebuild hasn't started yet? ZFS does this by using the DEGRADED pool status when a disk is missing.

tests/sys/geom/class/mirror/conf.sh
17

I think INT_MAX is a poor way to spell optional, especially in shell.

I'm not sure how wide the $((a+b)) calculation in wait_for is performed at, but it seems like INT_MAX might cause overflow there on (?)32-bit platforms.

tests/sys/geom/class/mirror/sync_error.sh
37

Look at the code in the test. The gmirror had 1/1 component before $md2 was added to the array. COMPLETE is accurate โ€” there is no missing disk.

Yes, there is a mirrorset generation number in the kernel that is incremented when components are added. You'd have to investigate how to look at that value from userspace.

Are you similarly unfamiliar with the other GEOM components you're changing in this revision? If so, that seems problematic for the rest of the revision.

tests/sys/geom/class/mirror/conf.sh
17

-1 would be fine though, right?

tests/sys/geom/class/mirror/conf.sh
17

@cem: I think I see what you mean. @asomers's concern might be coming from a place of there being a problem with a global timeout being exceeded in TAP testcases, whereas your concern is over the complexity required in both cases?

The right solution in this case would probably be to convert the TAP tests to ATF tests, since it's in an intermediary state (mix of ATF and TAP tests using the same underlying logic), reducing the need for having to worry about a global timeout. This intermediary state is causing an unnecessary split in terms of logic, forcing there to be unnecessary complexity (here and elsewhere) :/.

tests/sys/geom/class/mirror/sync_error.sh
37

@cem: I think @asomers addressing the practicality/determinant behavior of the sleep seems valid. What happens if the timeout (0.1) was incorrect, e.g., skewed or needed to be increased, such that the state that syncwait was expecting as initial state was not the expected state?

tests/sys/geom/class/mirror/sync_error.sh
37

*potentially indeterminate