Page MenuHomeFreeBSD

rtwn: change the USB TX transfers to only do one pending transfer per endpoint
ClosedPublic

Authored by adrian on Nov 12 2024, 5:00 AM.
Referenced Files
Unknown Object (File)
Dec 26 2024, 7:13 AM
Unknown Object (File)
Dec 26 2024, 7:01 AM
Unknown Object (File)
Dec 26 2024, 6:29 AM
Unknown Object (File)
Dec 26 2024, 5:29 AM
Unknown Object (File)
Dec 26 2024, 12:30 AM
Unknown Object (File)
Dec 25 2024, 10:32 PM
Unknown Object (File)
Dec 25 2024, 5:35 PM
Unknown Object (File)
Dec 25 2024, 9:16 AM
Subscribers

Details

Summary

I found I was getting constant device timeouts when doing anything
more complicated than a single SSH on laptop with RTL8811AU.

After digging into it, i found a variety of fun situations, including
traffic stalls that would recover w/ a shorter (1 second) USB transfer
timeout. However, the big one is a straight up hang of any TX endpoint
until the NIC was reset. The RX side kept going just fine; only the
TX endpoints would hang.

Reproducing it was easy - just start up a couple of traffic streams
on different WME AC's - eg a best effort + bulk transfer, like
browsing the web and doing an ssh clone - throw in a ping -i 0.1
to your gateway, and it would very quickly hit device timeouts every
couple of seconds.

I put everything into a single TX EP and the hangs went away.

So after some MORE digging, I found that this driver isn't checking
if the transfers are going into the correct EPs for the packet
WME access category / 802.11 TID; and would frequently be able
to schedule multiple transfers into the same endpoint.

My /guess/ was that the firmware isn't happy with one or both of the
above, and so I solved both.

  • drop the USB transfer timeout to 1 second, not 5 seconds - that way we'll either get a 1 second traffic pause and USB transfer failure, or a 5 second device timeout. Having both the TX timeout and the USB transfer timeout made recovery from a USB transfer timeout (without a NIC reset) almost impossible.
  • enforce one transfer per endpoint;
  • separate pending/active buffer tracking per endpoint;
  • each endpoint now has its own TX callback to make sure the queue / end point ID is known;
  • and only frames from a given endpoint pending queue is going into the active queue and into that endpoint.

This seems (fingers crossed) to have fixed the traffic hangs.

Locally tested:

  • rtl8812AU, 11n STA mode

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

emaste added inline comments.
sys/dev/rtwn/usb/rtwn_usb_tx.c
205

Worth a XXX or TODO comment to make this more discoverable?

bz requested changes to this revision.Nov 12 2024, 2:13 PM
bz added a subscriber: bz.

The #ifdef bits probably need to be addressed; the whitespace and punctuation could be optional.

sys/dev/rtwn/usb/rtwn_usb_attach.c
163–164

I'd possibly group the "inactive" bits together here.

sys/dev/rtwn/usb/rtwn_usb_tx.c
105

That's going to give warning/errors if D4054 gets defined; #ifdef missing here and initialization preferably still down there before the loop then.

201

sentences end in .

This revision now requires changes to proceed.Nov 12 2024, 2:13 PM
sys/dev/rtwn/usb/rtwn_usb_tx.c
105

also style says A for loop may declare and initialize its counting variable which will avoid the issue entirely for i.

We don't have anything that actually defines D4054 though, right? It's really just a placeholder for "update this if D4054 lands"?

note - I've tested this on a variety of other rtwn NICs now - RTL8821AU, RTL8188EU, RTL8192CU - they worked fine. But RTL8192EU isn't working; TX seems broken. I don't think it's this patch, but I'm going to test on older laptops and also earlier freebsd version snapshots.

So stay tuned, I really do want to make sure this works on all the realtek USB devices we currently have support for before I land this!

hm, looks like it may be a USB controller problem, hmph. Stay tuned.

adrian marked 3 inline comments as done and an inline comment as not done.Nov 14 2024, 1:08 AM
adrian added inline comments.
sys/dev/rtwn/usb/rtwn_usb_attach.c
163–164

what do you mean by group here?

sys/dev/rtwn/usb/rtwn_usb_tx.c
205

sure but I think I'm gonna tackle this one next, it's a common problem in the net80211 drivers

sys/dev/rtwn/usb/rtwn_usb_attach.c
163–164

what do you mean by group here?

The blank line splits awkwardly: the STAILQ_INIT for uc_tx_inactive and the loop with the STAILQ_INSERT_HEAD; I'd remove the blank line or move it above the STAILQ_INIT(..tx_inactive) given there seem to be two "blocks" of logic now.

Add a device_printf() to log when USB transfer errors happen, so it's easier to figure out what happened

hm, looks like it may be a USB controller problem, hmph. Stay tuned.

I do have a patch for this in my tree. Check PR/247528 for the fix.

oh it's THIS bug, ha! ok, lemme go re-review D38854 and see if we can figure out how to get it in cleanly.

oh it's THIS bug, ha! ok, lemme go re-review D38854 and see if we can figure out how to get it in cleanly.

XHCI vs EHCI.

address spacing comment from bz@

ok, please re-review this!

I'm still seeing some traffic hangs, so there's still something subtle going on, but it's at least not constant crashing anymore!

ok, i found the rest of the issue - in the EP setup, the EP array has all four endpoints to be configured, but then the init code manually walks the TX bulk endpoints and assigns them to the four endpoints.
In the case of the RTL8811AU it supports 4 TX bulk endpoints, but BK/BE are mapped to the same physical USB endpoint.
So, if I did parallel traffic between VO, VI and one of BK/BE, it was fine. If I did parallel traffic with BK and BE, traffic stalls.
If I then mapped BK and BE to different endpoints? It works fine.
But I can't DO that because there are other places in usb_tx_ep.c that setup some TX queue mapping bits in the hardware that I don't quite understand, and they all definitely only handle 1, 2 or 3 endpoints.
Also - the traffic stalls come back on the RTL8192EU because it has three TX bulk endpoints, so there's remapping still going on there.

The /real/ solution will be to only use a specific config, where we are guaranteed to have a single endpoint for each. However, this code in this diff is still a pre-requisite so all of the xfers line up correctly with their endpoints.

add local wme2qid fix to ensure endpoints are uniquely used.

ok, this diff now fixes both of the incorrect transfer / EP configuration issues, and everything seems fine now on the RTL NICs I've tested.

oops, remove duplicate code

This revision was not accepted when it landed; it landed in state Needs Review.Nov 21 2024, 1:57 AM
This revision was automatically updated to reflect the committed changes.