Page MenuHomeFreeBSD

sound: Fix hot-unload panics
ClosedPublic

Authored by christos on Nov 5 2024, 9:27 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Dec 20, 9:47 AM
Unknown Object (File)
Sun, Dec 8, 11:43 PM
Unknown Object (File)
Fri, Nov 29, 2:33 PM
Unknown Object (File)
Nov 26 2024, 5:27 PM
Unknown Object (File)
Nov 25 2024, 2:38 AM
Unknown Object (File)
Nov 17 2024, 4:02 AM
Unknown Object (File)
Nov 16 2024, 2:35 PM
Unknown Object (File)
Nov 15 2024, 7:12 PM
Subscribers

Details

Summary

This patch fixes multiple different panic scenarios occuring during
hot-unload:

  1. The channel is unlocked in chn_read()/chn_write() for uiomove(9) and in the meantime we enter pcm_killchans() and free it. By the time we have returned from userland and try to lock it back, the channel will have been freed.
  2. The parent channel has been freed in pcm_killchans(), but at the same time, some yet-unstopped vchan's chn_read()/chn_write() calls chn_start(), which eventually calls vchan_trigger(), which references the freed parent.
  3. PCM_WAIT() panics because it references a freed PCM lock.

For scenarios 1 and 2, refactor pcm_killchans() to first make sure all
channels have been stopped, and then proceed to free them one by one, as
opposed to freeing the first free channel until all channels have been
freed. This change makes the code more robust, but might introduce some
performance overhead when many channels are allocated, since we
continuously loop through the channel list until all of them are
stopped, and then we loop one last time to free them.

For scenario 3, restructure the code so that we can use destroy_dev(9)
instead of destroy_dev_sched(9) in dsp_destroy_dev(). Because
destroy_dev(9) blocks until all references to the device have went away,
we ensure that the PCM cv and lock will be freed safely.

While here, move the delete_unrhdr(9) calls to pcm_killchans() and
re-order some lines.

Sponsored by: The FreeBSD Foundation
MFC after: 2 days

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 60415
Build 57299: arc lint + arc unit

Event Timeline

@dev_submerge.ch Can you also test whether this patch along with all the other patches applied (D47459, D47460, D47461, D47463) fix the hot-unload (or any other) panics?

@dev_submerge.ch Can you also test whether this patch along with all the other patches applied (D47459, D47460, D47461, D47463) fix the hot-unload (or any other) panics?

Seems good, for me the panics we had are not reproducible anymore. Didn't have time yet to look at the code, though.

@dev_submerge.ch Can you also test whether this patch along with all the other patches applied (D47459, D47460, D47461, D47463) fix the hot-unload (or any other) panics?

Seems good, for me the panics we had are not reproducible anymore. Didn't have time yet to look at the code, though.

I also ran stress2, as well as a simple script that spawns lots of channels and hot-unloads abruptly, in an infinite loop, for more than an hour, without any problems. :-)

The refactored implementation of pcm_killchans() relies on the SD_F_REGISTERED flag (unset) to prevent addition of new channels, right? Should we make this explicit via assert or a comment?

The refactored implementation of pcm_killchans() relies on the SD_F_REGISTERED flag (unset) to prevent addition of new channels, right? Should we make this explicit via assert or a comment?

While it wouldn't do any harm to assert that !PCM_REGISTERED, I do not really think it's necessary, since pcm_killchans() is only called from pcm_unregister() once we clear the flag.

The refactored implementation of pcm_killchans() relies on the SD_F_REGISTERED flag (unset) to prevent addition of new channels, right? Should we make this explicit via assert or a comment?

While it wouldn't do any harm to assert that !PCM_REGISTERED, I do not really think it's necessary, since pcm_killchans() is only called from pcm_unregister() once we clear the flag.

Yes, my point was that we should make this requirement explicit to prevent anybody from using pcm_killchans() in another context. The algorithm may fail randomly if new channels can be added, and its correctness isn't immediately obvious, at least it was not to me.

Assert that we enter pcm_killchans() with !PCM_REGISTERED.

Any more comments on this one?

sys/dev/sound/pcm/sound.c
235

Do we need a break here? This makes the shutdown of the channels fully sequential. Is this required or would it make sense to shutdown all the channels in parallel?

sys/dev/sound/pcm/sound.c
235

It can probably go away. I will test it now.

christos marked an inline comment as done.

Remove break from pcm_killchans() loop. Gives us a performance boost. Thanks
Florian!

This revision is now accepted and ready to land.Nov 24 2024, 7:46 PM
This revision was automatically updated to reflect the committed changes.