Page MenuHomeFreeBSD

sound: Fix hot-unload panics
Needs ReviewPublic

Authored by christos on Tue, Nov 5, 9:27 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Nov 17, 4:02 AM
Unknown Object (File)
Sat, Nov 16, 2:35 PM
Unknown Object (File)
Fri, Nov 15, 7:12 PM
Unknown Object (File)
Thu, Nov 14, 6:51 AM
Unknown Object (File)
Mon, Nov 11, 12:48 PM
Unknown Object (File)
Mon, Nov 11, 12:27 PM
Unknown Object (File)
Thu, Nov 7, 8:32 PM
Unknown Object (File)
Wed, Nov 6, 6:31 AM
Subscribers

Details

Summary

This patch fixes multiple different panic scenarios occuring during
hot-unload:

  1. The channel is unlocked in chn_read()/chn_write() for uiomove(9) and in the meantime we enter pcm_killchans() and free it. By the time we have returned from userland and try to lock it back, the channel will have been freed.
  2. The parent channel has been freed in pcm_killchans(), but at the same time, some yet-unstopped vchan's chn_read()/chn_write() calls chn_start(), which eventually calls vchan_trigger(), which references the freed parent.
  3. PCM_WAIT() panics because it references a freed PCM lock.

For scenarios 1 and 2, refactor pcm_killchans() to first make sure all
channels have been stopped, and then proceed to free them one by one, as
opposed to freeing the first free channel until all channels have been
freed. This change makes the code more robust, but might introduce some
performance overhead when many channels are allocated, since we
continuously loop through the channel list until all of them are
stopped, and then we loop one last time to free them.

For scenario 3, restructure the code so that we can use destroy_dev(9)
instead of destroy_dev_sched(9) in dsp_destroy_dev(). Because
destroy_dev(9) blocks until all references to the device have went away,
we ensure that the PCM cv and lock will be freed safely.

While here, move the delete_unrhdr(9) calls to pcm_killchans() and
re-order some lines.

Sponsored by: The FreeBSD Foundation
MFC after: 2 days

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 60638
Build 57522: arc lint + arc unit

Event Timeline

@dev_submerge.ch Can you also test whether this patch along with all the other patches applied (D47459, D47460, D47461, D47463) fix the hot-unload (or any other) panics?

@dev_submerge.ch Can you also test whether this patch along with all the other patches applied (D47459, D47460, D47461, D47463) fix the hot-unload (or any other) panics?

Seems good, for me the panics we had are not reproducible anymore. Didn't have time yet to look at the code, though.

@dev_submerge.ch Can you also test whether this patch along with all the other patches applied (D47459, D47460, D47461, D47463) fix the hot-unload (or any other) panics?

Seems good, for me the panics we had are not reproducible anymore. Didn't have time yet to look at the code, though.

I also ran stress2, as well as a simple script that spawns lots of channels and hot-unloads abruptly, in an infinite loop, for more than an hour, without any problems. :-)

The refactored implementation of pcm_killchans() relies on the SD_F_REGISTERED flag (unset) to prevent addition of new channels, right? Should we make this explicit via assert or a comment?

The refactored implementation of pcm_killchans() relies on the SD_F_REGISTERED flag (unset) to prevent addition of new channels, right? Should we make this explicit via assert or a comment?

While it wouldn't do any harm to assert that !PCM_REGISTERED, I do not really think it's necessary, since pcm_killchans() is only called from pcm_unregister() once we clear the flag.

The refactored implementation of pcm_killchans() relies on the SD_F_REGISTERED flag (unset) to prevent addition of new channels, right? Should we make this explicit via assert or a comment?

While it wouldn't do any harm to assert that !PCM_REGISTERED, I do not really think it's necessary, since pcm_killchans() is only called from pcm_unregister() once we clear the flag.

Yes, my point was that we should make this requirement explicit to prevent anybody from using pcm_killchans() in another context. The algorithm may fail randomly if new channels can be added, and its correctness isn't immediately obvious, at least it was not to me.

Assert that we enter pcm_killchans() with !PCM_REGISTERED.