Page MenuHomeFreeBSD

Fix a panic during boot while executing the vt_timer() callout
ClosedPublic

Authored by jtl on Feb 27 2017, 7:02 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Dec 11, 3:44 AM
Unknown Object (File)
Dec 2 2024, 4:03 PM
Unknown Object (File)
Oct 18 2024, 10:53 PM
Unknown Object (File)
Oct 16 2024, 5:34 PM
Unknown Object (File)
Oct 9 2024, 3:10 AM
Unknown Object (File)
Oct 1 2024, 1:10 PM
Unknown Object (File)
Sep 26 2024, 11:10 PM
Unknown Object (File)
Sep 21 2024, 6:09 AM
Subscribers

Details

Summary

With EARLY_AP_STARTUP enabled, we are seeing crashes in softclock_call_cc() during bootup. Debugging information shows that softclock_call_cc() is trying to execute the vt_consdev.vd_timer callout, and the callout structure contains a NULL c_func.

This appears to be due to a race between vt_upgrade() running callout_reset() and vt_resume_flush_timer() calling callout_schedule().

Fix the race by ensuring that vd_timer_armed is always set before attempting to (re)schedule the callout.

Test Plan

I modified the code to add a delay in order to make it more likely we would hit the race condition. With that code and without this proposed patch, I saw crashes. With that code and with this proposed patch, I did not see crashes.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

jtl retitled this revision from to Fix a panic during boot while executing the vt_timer() callout.
jtl updated this object.
jtl edited the test plan for this revision. (Show Details)
jtl added reviewers: ed, ray, emaste.

By the time I see multiple atomics operations stacked like this, I start to wonder: isn't there a way to guarantee this by simply using locking APIs? We already call VT_LOCK()/VT_UNLOCK(), it seems. Maybe this is just missing in one of the other places?

sys/dev/vt/vt_core.c
2623 ↗(On Diff #25766)

What are the races? Might not be a bad idea to point to the other side of the race.

In D9828#202716, @ed wrote:

By the time I see multiple atomics operations stacked like this, I start to wonder: isn't there a way to guarantee this by simply using locking APIs? We already call VT_LOCK()/VT_UNLOCK(), it seems. Maybe this is just missing in one of the other places?

I just opened bug 217408 for this. I have more analysis there.

In short, the other end of the race may hold a spin lock when calling vt_resume_flush_timer(), so it can't acquire the vd lock (a regular mutex) at that point.

We don't actually need atomics here. But, we do care about order of operations, so I used them for the memory barriers they provide. We need to ensure that vd_timer_armed is 1 before we set VDF_ASYNC. We may be able to do away with the second atomic. I guess it isn't really critical VDF_ASYNC is set before we call vt_timer(). (And, even though it might be somewhat unusual, vt_timer() may well fire before callout_reset() returns in the case the thread gets pre-empted. But, as I said, it may actually not be critical that VDF_ASYNC is set before vt_timer() runs. So, maybe that was an overreaction on my part.)

sys/dev/vt/vt_core.c
2623 ↗(On Diff #25766)

ACK. I'll try to improve the comment.

jtl edited edge metadata.

Enhance the comment. Move an atomic operation to non-atomic.

I had a long discussion with @ed in the bug. You can see that for context. I think the summary is that I still think we should commit this code, even if @ed thinks a more robust fix requires a larger change. I would like to do that, but do also support @ed making a further bug fix in the future.

sys/dev/vt/vt_core.c
2627 ↗(On Diff #25772)

perhaps add a reference to the PR?

sys/dev/vt/vt_core.c
2627 ↗(On Diff #25772)

As we discussed today the dev summit, I think this change is still necessary even in the single-core case. So, it would still be necessary even with the change @ed proposed in the PR. Therefore, since this change shouldn't be temporary, it is not necessary to reference the PR in the comment.

This revision was automatically updated to reflect the committed changes.