Paths

Table of Contentst

snd_dummy: Make callout stopping more robust
ClosedPublic
Actions

Authored by christos on May 16 2025, 8:53 PM.

Details

Reviewers

markj
emaste
adrian
cperciva
imp
rrs

Commits

rGb760fa5bf0e9: snd_dummy: Make callout stopping more robust
rG0432b2183f25: snd_dummy: Make callout stopping more robust

Summary

If the callout gets rescheduled during detach, we might access freed
pcm_channel resources in dummy_chan_io(), which will cause a panic
similar to this:

panic: ASan: Invalid access, 8-byte read at 0xfffffe00479f65d8, UMAUseAfterFree(fd)
cpuid = 1
time = 1747433047
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0xa5/frame 0xfffffe0046a8d730
kdb_backtrace() at kdb_backtrace+0xc6/frame 0xfffffe0046a8d890
vpanic() at vpanic+0x226/frame 0xfffffe0046a8da30
panic() at panic+0xb5/frame 0xfffffe0046a8db00
kasan_code_name() at kasan_code_name/frame 0xfffffe0046a8dbd0
mtx_lock_flags() at mtx_lock_flags+0xd3/frame 0xfffffe0046a8dcc0
chn_intr() at chn_intr+0x3d/frame 0xfffffe0046a8dce0
dummy_chan_io() at dummy_chan_io+0x9c/frame 0xfffffe0046a8dd10
softclock_call_cc() at softclock_call_cc+0x2bb/frame 0xfffffe0046a8de80
softclock_thread() at softclock_thread+0x162/frame 0xfffffe0046a8def0
fork_exit() at fork_exit+0xa3/frame 0xfffffe0046a8df30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0046a8df30

trap 0, rip = 0, rsp = 0, rbp = 0 ---

KDB: enter: panic
[ thread pid 2 tid 100030 ]
Stopped at kdb_enter+0x34: movq $0,0x1f09af1(%rip)
db>

To fix this, run callout_stop() in a loop until the callout is
successfully stopped, and introduce a barrier flag to prevent it from
being rescheduled. Also call callout_drain() before pcm_unregister().

Sponsored by: The FreeBSD Foundation
MFC after: 1 day

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

christos requested review of this revision.May 16 2025, 8:53 PM

christos created this revision.

Harbormaster completed remote builds in B64229: Diff 155583.May 16 2025, 8:53 PM

yes! nice work!

This revision is now accepted and ready to land.May 16 2025, 9:56 PM

I ran the test (i.e., load snd_dummy, start virtual_oss, kill it and then unload snd_dummy) in a loop for ~1 hour without crashes. Before the patch it would crash during the first minutes.

Please add rrs to the reviewers

christos added a reviewer: rrs.May 17 2025, 10:19 AM

I'm not sure about this change. I wonder if a race elsewhere is causing dummy_chan_trigger() to schedule the timeout after dummy_detach() has finished?

sys/dev/sound/dummy.c
203	If this flag is true, then we are racing with dummy_detach(). That is, dummy_detach() has set `sc->stopped = true` and then released the lock, after which it destroys the mutex. So, if `sc->stopped` is true here, then there is a race condition somewhere else, and that needs to be fixed.
360	Why do you need this loop? Isn't the flag enough to ensure that the callout isn't rescheduled? Actually, wouldn't this be an infinite loop? If callout_stop() returns 0, then dummy_chan_io() is currently executing. Here we're holding the softc lock, which means that dummy_chan_io() dropped the lock to call chn_intr(). Then it'll try to reacquire the lock, which means that it'll block because dummy_detach() holds it, and then we're effectively deadlocked.

adrian added inline comments.May 17 2025, 2:23 PM

sys/dev/sound/dummy.c
360	hi! this is all from my suggestions to christos. here goes. i suggested the flag because it's the cleanest way to ensure that at any entry point into scheduling the callout can be wrapped with a guard without the flag, some other code that's running in parallel (especially ioctls, since those aren't in any way guarded against running overlapping with the driver shutdown/detach/suspend routines) could be waiting to acquire the lock, and once that loop completes, may immediately schedule the callout. Aha, and after reading the code, christos missed a couple things, and you're right about the deadlock! it shouldn't deadlock normally if the callout didn't drop the lock. The above loop is much more relevant if you're doing callouts that don't grab a lock before being called. However, he /is/ releasing the lock, doing some work, and re-acquiring the lock. In this case he needs to re-check the flag, and in his loop he needs to drop and re-acquire the lock. What do you think?

markj added inline comments.May 17 2025, 2:32 PM

sys/dev/sound/dummy.c
360	(especially ioctls, since those aren't in any way guarded against running overlapping with the driver shutdown/detach/suspend routines) err, they definitely should be? Detach should destroy associated character devices and block waiting for threads to drain before proceeding. This is what pcm_unregister()->dsp_destroy_dev() does. Otherwise we'd have all sorts of race conditions. Having a flag is fine, but callout_drain() should already handle the possibility of the callout rearming itself. So I suspect there is a problem elsewhere (or even possibly in the callout code, though I somewhat doubt that at this point).

adrian added inline comments.May 17 2025, 6:24 PM

sys/dev/sound/dummy.c
360	I agree that ioctls should be guarded, and in the past I've done that with said above flag. The last time I checked, ioctls weren't actually blocked. Once you deregister the cdev any /further/ calls fail, but any /current/ calls are still in flight. And since drivers have a habit of doing stuff like unlock;stuff;lock, and doing stuff outside of the driver lock, the ioctl and other paths are frequently running during inopportune times during driver teardown. (And honestly, outside of USB and some of the virtual ethernet/block drivers, driver teardown is not exactly the most stress tested codepaths, as I keep discovering. :-) )

christos added inline comments.May 17 2025, 6:39 PM

sys/dev/sound/dummy.c
360	The last time I checked, ioctls weren't actually blocked. Once you deregister the cdev any /further/ calls fail, but any /current/ calls are still in flight. And since drivers have a habit of doing stuff like unlock;stuff;lock, and doing stuff outside of the driver lock, the ioctl and other paths are frequently running during inopportune times during driver teardown. This scenario is probably what causes the race here. It's a bit "random" to reproduce this panic, so I suppose at some point we hit some narrow window where `dummy_chan_trigger()` is externally called during detach, and as a result we start rescheduling the callout. I'm not sure there is some other race here in snd_dummy(4) - the only place where the callout can be rescheduled is in `dummy_chan_trigger()`, and the last place we call this is in `pcm_unregister()` when destroying the channels, but what we call essentially is `callout_stop()`, so on our part there is no rescheduling that I can see. And the fact that the panic backtrace shows an access to a freed channel lock (i.e., after `pcm_unregister()` returned), makes me think @adrian is right.

christos added inline comments.May 17 2025, 7:21 PM

sys/dev/sound/dummy.c
360	Maybe that's better? for (;;) { snd_mtxlock(sc->lock); sc->stopped = true; err = callout_stop(&sc->callout); snd_mtxunlock(sc->lock); if (err != 0) break; }

markj added inline comments.May 18 2025, 5:39 PM

sys/dev/sound/dummy.c
360	The last time I checked, ioctls weren't actually blocked. Once you deregister the cdev any /further/ calls fail, but any /current/ calls are still in flight. Fortunately, we have the sources, so we don't need to guess. ;) In destroy_dev() -> destroy_devl(), since 2006 there is a loop `while (dev->si_threadcount != 0) sleep()`. si_threadcount is incremented by devvn_refthread() at the beginning of devfs_ioctl(), before the device's d_ioctl implementation is called, and in incrementing the count it atomically checks for device destruction. So it seems pretty clear that pcm_register() will drain dsp_ioctl() callers, or at least that devfs isn't the culprit here. It's a bit "random" to reproduce this panic How exactly does one reproduce it?

christos added inline comments.May 21 2025, 7:27 PM

sys/dev/sound/dummy.c
360	How exactly does one reproduce it? You can just load snd_dummy, write to it, and hot-unload, in an endless loop, and at some point (could take a while) it should panic. The way I found it was with this: #!/bin/sh while true; do kldload snd_dummy virtual_oss -S -C 2 -c 2 -r 48000 -b 16 -s 1024 -f /dev/dsp0 -d dsp -t vdsp.ctl & sleep 2 killall virtual_oss sleep 2 kldunload snd_dummy done The reason why this can panic it in the first place, is not really relevant here, but I'm actually exploiting a different bug (either in virtual_oss or sound(4)), which I'm still trying to figure out. Basically even if you kill virtual_oss, the primary channels keep being used somehow, so there might be a small window where the callout gets rescheduled during detach, hence the panic I've attached in the commit message.

The bug D50488 solves could be a reason why the callout was being rescheduled.

In D50386#1152342, @christos wrote:

The bug D50488 solves could be a reason why the callout was being rescheduled.

It could be, but I still think it's worth being more defensive in making sure we clean things up / stop things right.

In D50386#1152459, @adrian wrote:

In D50386#1152342, @christos wrote:

The bug D50488 solves could be a reason why the callout was being rescheduled.

It could be, but I still think it's worth being more defensive in making sure we clean things up / stop things right.

No objection to that. :)

In D50386#1152459, @adrian wrote:

In D50386#1152342, @christos wrote:

The bug D50488 solves could be a reason why the callout was being rescheduled.

It could be, but I still think it's worth being more defensive in making sure we clean things up / stop things right.

Again, the change in the patch introduces a new bug: if callout_stop() returns 0, then the callout thread is blocked on the softc lock, which means that it's waiting for the callout_stop() caller to release the lock, which means that the loop while (callout_stop(&sc->callout) == 0) ; never terminates.

In D50386#1153172, @markj wrote:

In D50386#1152459, @adrian wrote:

In D50386#1152342, @christos wrote:

The bug D50488 solves could be a reason why the callout was being rescheduled.

It could be, but I still think it's worth being more defensive in making sure we clean things up / stop things right.

Again, the change in the patch introduces a new bug: if callout_stop() returns 0, then the callout thread is blocked on the softc lock, which means that it's waiting for the callout_stop() caller to release the lock, which means that the loop while (callout_stop(&sc->callout) == 0) ; never terminates.

oh i thought christos fixed that! Yeah we should fix that before landing.

Sorry, I've had so many diffs to review lately ;-)

In D50386#1153172, @markj wrote:

In D50386#1152459, @adrian wrote:

In D50386#1152342, @christos wrote:

The bug D50488 solves could be a reason why the callout was being rescheduled.

It could be, but I still think it's worth being more defensive in making sure we clean things up / stop things right.

Again, the change in the patch introduces a new bug: if callout_stop() returns 0, then the callout thread is blocked on the softc lock, which means that it's waiting for the callout_stop() caller to release the lock, which means that the loop while (callout_stop(&sc->callout) == 0) ; never terminates.

I've posted a possible fix below which I've been running with for a while and seems to not cause any problems. Updating the diff.

Unlock after callout_stop() in the loop.

This revision now requires review to proceed.May 25 2025, 3:58 PM

Harbormaster completed remote builds in B64422: Diff 156013.May 25 2025, 3:58 PM

In D50386#1153198, @christos wrote:

In D50386#1153172, @markj wrote:

In D50386#1152459, @adrian wrote:

In D50386#1152342, @christos wrote:

The bug D50488 solves could be a reason why the callout was being rescheduled.

It could be, but I still think it's worth being more defensive in making sure we clean things up / stop things right.

Again, the change in the patch introduces a new bug: if callout_stop() returns 0, then the callout thread is blocked on the softc lock, which means that it's waiting for the callout_stop() caller to release the lock, which means that the loop while (callout_stop(&sc->callout) == 0) ; never terminates.

I've posted a possible fix below which I've been running with for a while and seems to not cause any problems. Updating the diff.

Ok, but what's the purpose of the loop? What race does it solve?

If you have a flag which ensures that the callout doesn't get rescheduled, and all consumers atomically check the flag before rescheduling the callout, then just

mtx_lock(&sc->lock);
sc->stopped = true;
mtx_unlock(&sc->lock);
callout_drain();

should be sufficient.

Address Mark's comments.

Harbormaster completed remote builds in B64505: Diff 156197.May 28 2025, 3:18 PM

adrian accepted this revision.May 28 2025, 3:20 PM

This revision is now accepted and ready to land.May 28 2025, 3:20 PM

Closed by commit rG0432b2183f25: snd_dummy: Make callout stopping more robust (authored by christos). · Explain WhyMay 29 2025, 10:25 AM

This revision was automatically updated to reflect the committed changes.

christos added a commit: rG0432b2183f25: snd_dummy: Make callout stopping more robust.

christos added a commit: rGb760fa5bf0e9: snd_dummy: Make callout stopping more robust.May 29 2025, 10:10 PM

Revision Contents
Changeset List

Path

Size

sys/

dev/

sound/

dummy.c

17 lines

Diff 156249

View Options

snd_dummy: Make callout stopping more robustClosedPublicActions