Page MenuHomeFreeBSD

syslogd/tests: Address races
ClosedPublic

Authored by markj on Mon, Jan 19, 3:41 PM.
Tags
None
Referenced Files
F143043310: D54779.id170246.diff
Sun, Jan 25, 11:22 AM
F143043304: D54779.id170044.diff
Sun, Jan 25, 11:22 AM
F143043299: D54779.id.diff
Sun, Jan 25, 11:22 AM
F143023435: D54779.id170218.diff
Sun, Jan 25, 8:23 AM
F143023427: D54779.id170084.diff
Sun, Jan 25, 8:23 AM
F143013086: D54779.diff
Sun, Jan 25, 6:42 AM
Unknown Object (File)
Sat, Jan 24, 10:33 AM
Unknown Object (File)
Sat, Jan 24, 7:07 AM
Subscribers

Details

Summary

I occasionally see failures in the syslogd test suite. The problem is
that the tests are racy: they often 1) send a message using logger(1),
then 2) immediately check whether the message was logged to a log file.
If the syslogd instance under test doesn't get a chance to run before
step 2 the test fails.

This change reworks things to avoid the race while minimizing the amount
of time sleeping.

  1. Each test uses a single logfile, so have them use a new common variable, SYSLOGD_LOGFILE, instead of something test-specific.
  2. In syslogd_start(), if the configuration references SYSLOGD_LOGFILE, wait for it to be created by syslogd before returning. Record its modification time.
  3. Add a helper syslogd_check_log(), to check for a given log entry in the last line of SYSLOGD_LOGFILE, instead of using atf_check directly.
  4. In syslogd_check_log(), wait for the modification time of the logfile to change relative to syslogd_startup or a previous syslogd_check_log() call, before checking for the desired log entry.

With this change, I was able to run the tests 1000 times in a loop with
4-way parallelism without seeing any test failures. Without the change
I usually get a failure within 10 loops.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 69985
Build 66868: arc lint + arc unit

Event Timeline

Nice! I may steal a few ideas for some other similar tests.
I made a small observation, that can be ignored, I'm just thrilled with my recent discovery of atf_check -r.

usr.sbin/syslogd/tests/syslogd_test_common.sh
200

I wonder if almost the entirety of this function can be replaced with:

atf_check -r 50 test $(cat MTIME) != ${mtime}
This revision is now accepted and ready to land.Mon, Jan 19, 8:27 PM
usr.sbin/syslogd/tests/syslogd_test_common.sh
200

Oh huh. I think I can use that to get rid of MTIME entirely: when we're looking for logfile entries, just do something like:

atf_check -r 5 -o match:"${msg}" tail -n 1 "${SYSLOGD_LOGFILE}"

I'll give that a try.

usr.sbin/syslogd/tests/syslogd_test_common.sh
200

This almost works, but there's a small issue: if the command times out, as it will in some syslogd tests which are marked expected-to-fail, the test result is "broken" instead of "expected failure".

I think I'll just modify those test cases to use syslogd_check_log_nopoll() instead, as the -r trick makes this patch much simpler.

Follow jlduran's suggestion of using atf_check -r.

This revision now requires review to proceed.Mon, Jan 19, 11:58 PM

Very nice!

usr.sbin/syslogd/tests/syslogd_test_common.sh
200

I think that is a bug in atf-check. Will investigate.

This revision is now accepted and ready to land.Tue, Jan 20, 1:37 AM
usr.sbin/syslogd/tests/syslogd_test_common.sh
200

Yeah, I tend to agree. I think if the global test timeout elapses, then the test should be marked broken regardless of whether it's expected to fail, but individual atf_check timeouts should be handled the same as if atf_check itself had failed.

usr.sbin/syslogd/tests/syslogd_test_common.sh
200

More info on the "bug": If you use any timeout less than 5, it works as expected:

atf_check -r 4 -o match:"${msg}" tail -n 1 "${SYSLOGD_LOGFILE}"
usr.sbin/syslogd/tests/syslogd_test_common.sh
200

Ahh, good find, we have this set_common_atf_metadata() function which sets the timeout. syslogd/tests/Makefile already sets a timeout of 20s for each test, which I think is fine... what do you think of simply removing this 5s timeout?

usr.sbin/syslogd/tests/syslogd_test_common.sh
200

Ah, yes, I agree. Also, you should be able to remove syslogd_check_log_nopoll(), as syslogd_check_log() will fail as expected after 10 seconds (and not broken).

Remove the 5s timeout, switch away from using syslogd_check_log_nopoll()
in tests which are expected to fail.

This revision now requires review to proceed.Wed, Jan 21, 8:33 PM
This revision is now accepted and ready to land.Wed, Jan 21, 8:49 PM
This revision was automatically updated to reflect the committed changes.