Page MenuHomeFreeBSD

usb: Refactor usb_proc_msg
Needs RevisionPublic

Authored by aokblast on Tue, May 12, 2:30 PM.
Tags
Referenced Files
Unknown Object (File)
Thu, May 28, 12:48 PM
Unknown Object (File)
Thu, May 28, 4:30 AM
Unknown Object (File)
Wed, May 27, 9:58 PM
Unknown Object (File)
Wed, May 27, 11:31 AM
Unknown Object (File)
Wed, May 27, 1:14 AM
Unknown Object (File)
Mon, May 25, 7:50 PM
Unknown Object (File)
Mon, May 25, 7:45 PM
Unknown Object (File)
Mon, May 25, 6:54 PM
Subscribers

Details

Summary

Previously, the USB async process implemented its own worker and event handling. This design allowed all USB code to hold the mutex while events were being processed. However, the async process is not used directly: transfer, enumeration, and initilization are all wrapped internally, so driver developers do not need to manage it themselves.

This change decouples the mutex from the USB sync thread, reducing the size of critical sections. The tradeoff is that any custom async thread implementation must now handle locking explicitly.

The implementation is also clearer by using taskqueue.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 73044
Build 69927: arc lint + arc unit

Event Timeline

  • draft:
  • fix: AI fix
  • fix by me

Sorry, on the list to give this a closer look today- it looked nice on the surface and cleans up one annoyance I've had in the past with how this stuff works, but I want to be sure that we don't accidentally regress the serial bits.

Sorry, on the list to give this a closer look today- it looked nice on the surface and cleans up one annoyance I've had in the past with how this stuff works, but I want to be sure that we don't accidentally regress the serial bits.

Thanks. I haven't test on umb (maybe @khorben can help), ucom, and ugold device a I have none of them. I can wait further to find more testers.

sys/dev/usb/serial/usb_serial.c
627

This seems racey- what's stopping the proc from going away as soon as the lock is dropped and leading to an immediate NULL dereference? Actually, it's not even clear that the lock is necessary for the proc to be freed in the first place, which might be a larger problem.

631–632

I would just drop the error local entirely, I don't see this becoming any more complicated.

sys/dev/usb/usb_process.c
131

Reading`taskqueue(9)`, this seems like it might result in deadlock? See the massive paragraph about taskqueue_block()

200

Ditto above re: deadlock potential

sys/dev/usb/serial/usb_serial.c
627

All of the caller of this function is tty related function. And usb_proc_free is called after ucom_detach_tty. I need to double check again if it really happens.

sys/dev/usb/usb_process.c
131

Oh you are right. I will fix it. Thanks!

adrian requested changes to this revision.Sat, May 23, 4:44 PM
adrian added a subscriber: adrian.

So I'm starting to understand what you're trying to do here, however just a few things:

  • this code gets used or was used in some bootloader builds, right? So we need to make sure it runs in its standalone mode versus in the kernel
  • I'm a fan of pushing serialising items into a taskqueue to ensure they get run in order, but you definitely need to make sure you pay attention to locking and draining needs
  • I suggest thinking about what locks are and are not supposed to be held across schedule, drain, etc and put explicit lock assertions there
  • the moment I see UNLOCK; xxx ; LOCK patterns I am immediately worried. We already have a whole class of problems with USB consumers because of this (there's a bug about USB ethernet NICs getting link flaps and it's due to unlock;X;relock in USB allowing other work to make forward progress, messing things up!) so you really REALLY need to avoid that.
  • the point of bus lock ; mwait / signal ; unlock in the original code is rooted in making sure it's not racy with other things. You've changed it to do the signaling /outside/ of a lock, which can introduce timing races where something happens between your unlock and your signaling call.

So I'm going to request a whole big rethink about this. I like that you're trying to do work ordering using a taskqueue for work, but I'm very hesitant to see this happen given the common "racy" anti-patterns being introduced to work around locking issues (unlock; work; relock as a big example.)

This revision now requires changes to proceed.Sat, May 23, 4:44 PM

So I'm starting to understand what you're trying to do here, however just a few things:

  • this code gets used or was used in some bootloader builds, right? So we need to make sure it runs in its standalone mode versus in the kernel

Oh. I totally forget this. I will check if it works with our bootloader.

  • I'm a fan of pushing serialising items into a taskqueue to ensure they get run in order, but you definitely need to make sure you pay attention to locking and draining needs
  • I suggest thinking about what locks are and are not supposed to be held across schedule, drain, etc and put explicit lock assertions there

You are right. I will start to add more asserts tomorrow.

  • the moment I see UNLOCK; xxx ; LOCK patterns I am immediately worried. We already have a whole class of problems with USB consumers because of this (there's a bug about USB ethernet NICs getting link flaps and it's due to unlock;X;relock in USB allowing other work to make forward progress, messing things up!) so you really REALLY need to avoid that.

UNLCOK; xxx; LOCK; only happens when we need to mwait something. Actually, the original mwait implementation do this implicitly as we are waiting something that also requires this lock.

  • the point of bus lock ; mwait / signal ; unlock in the original code is rooted in making sure it's not racy with other things. You've changed it to do the signaling /outside/ of a lock, which can introduce timing races where something happens between your unlock and your signaling call.

I don't unlock the mtx when we only use signal. As you said, it breaks the synchronozation. Maybe I falsely unlock something when we only need to signal. I will carefully check again tomorrow.

So I'm going to request a whole big rethink about this. I like that you're trying to do work ordering using a taskqueue for work, but I'm very hesitant to see this happen given the common "racy" anti-patterns being introduced to work around locking issues (unlock; work; relock as a big example.)

My strategy for these modifications is that we initially held the lock implicitly, but now we need to acquire it explicitly.

For callbacks:

If the prologue already performs an unlock operation, I remove the unlock and do not add a new lock.
If the prologue does not unlock, I explicitly add the lock back.
If the epilogue performs a lock operation, I remove the lock; otherwise, I add an unlock.

For callers (e.g. msignal, mwait):

If only msignal is used, we do nothing — no unlock and no lock — which preserves the original ordering.
If msignal is followed by mwait, we unlock only if required. Since mwait immediately follows msignal, this should not break existing behavior.

The original implementation also had a limitation: it only accepted one (or two) types of events. As a result, a hotplug event had to enumerate the entire bus and fetch the port status for the bus and all child devices, which held the enumeration lock for a long time.

With this change, we can set up per-hub events and enumerate only the hub that triggered the hotplug interrupt.

ugh yeah you're right, our mwait() implementation does the same thing and has the same issues, doesn't it? I've come to increasingly not like that; code really needs to treat exiting the mwait() as "all state may have changed", and it's not really suitable for serialising stuff like usb state / transfers.

Want to take a look at the USB stack and bootloader stuff in particular? I'll go look at the unlock;XX;relock bit some more and see if there's something else which can be done or whether it needs to just be clearly commented in the function and whatever it is calling.

sys/dev/usb/controller/usb_controller.c
577–579

Hm, looks like these are ok to call inside the lock; but they (usb_bus_suspend/resume) are not asserting that they're being called with the lock.

sys/dev/usb/net/if_umb.c
983

ok so here's a locking change example i want to wrap my head around. why'd you remove the lock here now? eg can you guarantee that its not racing with another thread? usb_proc_is_gone() is checking the tq isn't false, but whats to prevent another thread tearing it down between the call to usb_proc_is_gone() and usb_proc_msignal() ? or usb_proc_mwait() afterwards?