Page MenuHomeFreeBSD

snd_dummy: Make callout stopping more robust
ClosedPublic

Authored by christos on May 16 2025, 8:53 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Oct 17, 7:43 AM
Unknown Object (File)
Wed, Oct 15, 11:33 PM
Unknown Object (File)
Wed, Oct 15, 11:33 PM
Unknown Object (File)
Wed, Oct 15, 10:31 PM
Unknown Object (File)
Fri, Oct 10, 11:29 PM
Unknown Object (File)
Mon, Oct 6, 11:06 AM
Unknown Object (File)
Sun, Oct 5, 12:15 AM
Unknown Object (File)
Tue, Sep 23, 6:24 PM
Subscribers
None

Details

Summary

If the callout gets rescheduled during detach, we might access freed
pcm_channel resources in dummy_chan_io(), which will cause a panic
similar to this:

panic: ASan: Invalid access, 8-byte read at 0xfffffe00479f65d8, UMAUseAfterFree(fd)
cpuid = 1
time = 1747433047
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0xa5/frame 0xfffffe0046a8d730
kdb_backtrace() at kdb_backtrace+0xc6/frame 0xfffffe0046a8d890
vpanic() at vpanic+0x226/frame 0xfffffe0046a8da30
panic() at panic+0xb5/frame 0xfffffe0046a8db00
kasan_code_name() at kasan_code_name/frame 0xfffffe0046a8dbd0
mtx_lock_flags() at mtx_lock_flags+0xd3/frame 0xfffffe0046a8dcc0
chn_intr() at chn_intr+0x3d/frame 0xfffffe0046a8dce0
dummy_chan_io() at dummy_chan_io+0x9c/frame 0xfffffe0046a8dd10
softclock_call_cc() at softclock_call_cc+0x2bb/frame 0xfffffe0046a8de80
softclock_thread() at softclock_thread+0x162/frame 0xfffffe0046a8def0
fork_exit() at fork_exit+0xa3/frame 0xfffffe0046a8df30
fork_trampoline() at fork_trampoline+0xe/frame 0xfffffe0046a8df30

  • trap 0, rip = 0, rsp = 0, rbp = 0 ---

KDB: enter: panic
[ thread pid 2 tid 100030 ]
Stopped at kdb_enter+0x34: movq $0,0x1f09af1(%rip)
db>

To fix this, run callout_stop() in a loop until the callout is
successfully stopped, and introduce a barrier flag to prevent it from
being rescheduled. Also call callout_drain() before pcm_unregister().

Sponsored by: The FreeBSD Foundation
MFC after: 1 day

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

christos created this revision.
This revision is now accepted and ready to land.May 16 2025, 9:56 PM

I ran the test (i.e., load snd_dummy, start virtual_oss, kill it and then unload snd_dummy) in a loop for ~1 hour without crashes. Before the patch it would crash during the first minutes.

Please add rrs to the reviewers

I'm not sure about this change. I wonder if a race elsewhere is causing dummy_chan_trigger() to schedule the timeout after dummy_detach() has finished?

sys/dev/sound/dummy.c
203

If this flag is true, then we are racing with dummy_detach(). That is, dummy_detach() has set sc->stopped = true and then released the lock, after which it destroys the mutex.

So, if sc->stopped is true here, then there is a race condition somewhere else, and that needs to be fixed.

360

Why do you need this loop? Isn't the flag enough to ensure that the callout isn't rescheduled?

Actually, wouldn't this be an infinite loop? If callout_stop() returns 0, then dummy_chan_io() is currently executing. Here we're holding the softc lock, which means that dummy_chan_io() dropped the lock to call chn_intr(). Then it'll try to reacquire the lock, which means that it'll block because dummy_detach() holds it, and then we're effectively deadlocked.

sys/dev/sound/dummy.c
360

hi! this is all from my suggestions to christos. here goes.

  • i suggested the flag because it's the cleanest way to ensure that at any entry point into scheduling the callout can be wrapped with a guard
  • without the flag, some other code that's running in parallel (especially ioctls, since those aren't in any way guarded against running overlapping with the driver shutdown/detach/suspend routines) could be waiting to acquire the lock, and once that loop completes, may immediately schedule the callout.

Aha, and after reading the code, christos missed a couple things, and you're right about the deadlock!

  • it shouldn't deadlock normally if the callout didn't drop the lock. The above loop is much more relevant if you're doing callouts that don't grab a lock before being called. However, he /is/ releasing the lock, doing some work, and re-acquiring the lock. In this case he needs to re-check the flag, and in his loop he needs to drop and re-acquire the lock.

What do you think?

sys/dev/sound/dummy.c
360

(especially ioctls, since those aren't in any way guarded against running overlapping with the driver shutdown/detach/suspend routines)

err, they definitely should be? Detach should destroy associated character devices and block waiting for threads to drain before proceeding. This is what pcm_unregister()->dsp_destroy_dev() does. Otherwise we'd have all sorts of race conditions.

Having a flag is fine, but callout_drain() should already handle the possibility of the callout rearming itself. So I suspect there is a problem elsewhere (or even possibly in the callout code, though I somewhat doubt that at this point).

sys/dev/sound/dummy.c
360

I agree that ioctls should be guarded, and in the past I've done that with said above flag.

The last time I checked, ioctls weren't actually blocked. Once you deregister the cdev any /further/ calls fail, but any /current/ calls are still in flight. And since drivers have a habit of doing stuff like unlock;stuff;lock, and doing stuff outside of the driver lock, the ioctl and other paths are frequently running during inopportune times during driver teardown.

(And honestly, outside of USB and some of the virtual ethernet/block drivers, driver teardown is not exactly the most stress tested codepaths, as I keep discovering. :-) )

sys/dev/sound/dummy.c
360

The last time I checked, ioctls weren't actually blocked. Once you deregister the cdev any /further/ calls fail, but any /current/ calls are still in flight. And since drivers have a habit of doing stuff like unlock;stuff;lock, and doing stuff outside of the driver lock, the ioctl and other paths are frequently running during inopportune times during driver teardown.

This scenario is probably what causes the race here. It's a bit "random" to reproduce this panic, so I suppose at some point we hit some narrow window where dummy_chan_trigger() is externally called during detach, and as a result we start rescheduling the callout.

I'm not sure there is some other race here in snd_dummy(4) - the only place where the callout can be rescheduled is in dummy_chan_trigger(), and the last place we call this is in pcm_unregister() when destroying the channels, but what we call essentially is callout_stop(), so on our part there is no rescheduling that I can see. And the fact that the panic backtrace shows an access to a freed channel lock (i.e., after pcm_unregister() returned), makes me think @adrian is right.

sys/dev/sound/dummy.c
360

Maybe that's better?

	for (;;) {
		snd_mtxlock(sc->lock);
		sc->stopped = true;
		err = callout_stop(&sc->callout);
		snd_mtxunlock(sc->lock);
		if (err != 0)
			break;
	}
sys/dev/sound/dummy.c
360

The last time I checked, ioctls weren't actually blocked. Once you deregister the cdev any /further/ calls fail, but any /current/ calls are still in flight.

Fortunately, we have the sources, so we don't need to guess. ;)

In destroy_dev() -> destroy_devl(), since 2006 there is a loop while (dev->si_threadcount != 0) sleep(). si_threadcount is incremented by devvn_refthread() at the beginning of devfs_ioctl(), before the device's d_ioctl implementation is called, and in incrementing the count it atomically checks for device destruction. So it seems pretty clear that pcm_register() will drain dsp_ioctl() callers, or at least that devfs isn't the culprit here.

It's a bit "random" to reproduce this panic

How exactly does one reproduce it?

sys/dev/sound/dummy.c
360

How exactly does one reproduce it?

You can just load snd_dummy, write to it, and hot-unload, in an endless loop, and at some point (could take a while) it should panic. The way I found it was with this:

#!/bin/sh

while true; do
        kldload snd_dummy
        virtual_oss -S -C 2 -c 2 -r 48000 -b 16 -s 1024 -f /dev/dsp0 -d dsp -t vdsp.ctl &
        sleep 2
        killall virtual_oss
        sleep 2
        kldunload snd_dummy
done

The reason why this can panic it in the first place, is not really relevant here, but I'm actually exploiting a different bug (either in virtual_oss or sound(4)), which I'm still trying to figure out. Basically even if you kill virtual_oss, the primary channels keep being used somehow, so there might be a small window where the callout gets rescheduled during detach, hence the panic I've attached in the commit message.

The bug D50488 solves could be a reason why the callout was being rescheduled.

The bug D50488 solves could be a reason why the callout was being rescheduled.

It could be, but I still think it's worth being more defensive in making sure we clean things up / stop things right.

The bug D50488 solves could be a reason why the callout was being rescheduled.

It could be, but I still think it's worth being more defensive in making sure we clean things up / stop things right.

No objection to that. :)

The bug D50488 solves could be a reason why the callout was being rescheduled.

It could be, but I still think it's worth being more defensive in making sure we clean things up / stop things right.

Again, the change in the patch introduces a new bug: if callout_stop() returns 0, then the callout thread is blocked on the softc lock, which means that it's waiting for the callout_stop() caller to release the lock, which means that the loop while (callout_stop(&sc->callout) == 0) ; never terminates.

The bug D50488 solves could be a reason why the callout was being rescheduled.

It could be, but I still think it's worth being more defensive in making sure we clean things up / stop things right.

Again, the change in the patch introduces a new bug: if callout_stop() returns 0, then the callout thread is blocked on the softc lock, which means that it's waiting for the callout_stop() caller to release the lock, which means that the loop while (callout_stop(&sc->callout) == 0) ; never terminates.

oh i thought christos fixed that! Yeah we should fix that before landing.

Sorry, I've had so many diffs to review lately ;-)

The bug D50488 solves could be a reason why the callout was being rescheduled.

It could be, but I still think it's worth being more defensive in making sure we clean things up / stop things right.

Again, the change in the patch introduces a new bug: if callout_stop() returns 0, then the callout thread is blocked on the softc lock, which means that it's waiting for the callout_stop() caller to release the lock, which means that the loop while (callout_stop(&sc->callout) == 0) ; never terminates.

I've posted a possible fix below which I've been running with for a while and seems to not cause any problems. Updating the diff.

Unlock after callout_stop() in the loop.

This revision now requires review to proceed.May 25 2025, 3:58 PM

The bug D50488 solves could be a reason why the callout was being rescheduled.

It could be, but I still think it's worth being more defensive in making sure we clean things up / stop things right.

Again, the change in the patch introduces a new bug: if callout_stop() returns 0, then the callout thread is blocked on the softc lock, which means that it's waiting for the callout_stop() caller to release the lock, which means that the loop while (callout_stop(&sc->callout) == 0) ; never terminates.

I've posted a possible fix below which I've been running with for a while and seems to not cause any problems. Updating the diff.

Ok, but what's the purpose of the loop? What race does it solve?

If you have a flag which ensures that the callout doesn't get rescheduled, and all consumers atomically check the flag before rescheduling the callout, then just

mtx_lock(&sc->lock);
sc->stopped = true;
mtx_unlock(&sc->lock);
callout_drain();

should be sufficient.

This revision is now accepted and ready to land.May 28 2025, 3:20 PM