Page MenuHomeFreeBSD

vtnet: Fix an LOR in the input path
AcceptedPublic

Authored by markj on Jul 11 2024, 2:50 PM.
Tags
None
Referenced Files
Unknown Object (File)
Oct 15 2024, 4:13 PM
Unknown Object (File)
Oct 15 2024, 4:13 PM
Unknown Object (File)
Oct 15 2024, 4:10 PM
Unknown Object (File)
Oct 12 2024, 2:33 AM
Unknown Object (File)
Sep 30 2024, 9:26 PM
Unknown Object (File)
Sep 20 2024, 3:30 PM
Unknown Object (File)
Sep 17 2024, 11:38 PM
Unknown Object (File)
Sep 15 2024, 3:36 PM
Subscribers

Details

Reviewers
bryanv
zlei
gallatin
Group Reviewers
network
Summary

if_vtnet holds a queue lock when processing incoming packets. It sends
them up the stack without dropping the lock, which is incorrect. For
example, witness generates warnings when if_vtnet receives ND6 packets:

 may sleep with the following non-sleepable locks held:
exclusive sleep mutex vtnet1-rx0 (vtnet1-rx0) r = 0 (0xfffff80001790000) locked @ /root/freebsd/sys/dev/virtio/network/if_vtnet.c:2202
stack backtrace:
.#0 0xffffffff80bd6785 at witness_debugger+0x65
.#1 0xffffffff80bd78e3 at witness_warn+0x413
.#2 0xffffffff80d8a504 at in6_update_ifa+0xcb4
.#3 0xffffffff80db7030 at in6_ifadd+0x1e0
.#4 0xffffffff80db3734 at nd6_ra_input+0x1024
.#5 0xffffffff80d8430b at icmp6_input+0x5ab
.#6 0xffffffff80d9db68 at ip6_input+0xc78
.#7 0xffffffff80cbb21d at netisr_dispatch_src+0xad
.#8 0xffffffff80c9db0a at ether_demux+0x16a
.#9 0xffffffff80c9f15a at ether_nh_input+0x34a
.#10 0xffffffff80cbb21d at netisr_dispatch_src+0xad
.#11 0xffffffff80c9df69 at ether_input+0xd9
.#12 0xffffffff80986240 at vtnet_rxq_eof+0x790
.#13 0xffffffff809859fc at vtnet_rx_vq_process+0x9c
.#14 0xffffffff80b19dc6 at ithread_loop+0x266
.#15 0xffffffff80b161f2 at fork_exit+0x82
.#16 0xffffffff8105322e at fork_trampoline+0xe

Simply drop the lock before entering the network stack. We could batch
this by linking a small number of packets using m_nextpkt; ether_input()
handles such batches.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 58614
Build 55502: arc lint + arc unit

Event Timeline

markj requested review of this revision.Jul 11 2024, 2:50 PM
zlei added a subscriber: zlei.

I think this is certainly the right fix.

This revision is now accepted and ready to land.Jul 14 2024, 3:31 PM

Added Drew since he has been working on this driver recently.

Is this safe? I think so, but I confess that I don't know the low level details in this driver very well.

When I read a network driver, I view a lock around rx processing as an indicator there is room for improvement in the design. The reason the lock seems to exist is to serialize rx ring servicing between the ithread (the normal path) and the taskqueue (which is woken if we continually find more packets to process... or maybe if interrupts don't work..?). I don't really understand code at the bottom of vtnet_rx_vq_process(). It seems like interrupts should be disabled if switching to the taskqueue, and enabled if returning from it. It probably has something to do with the "race" mentioned in that code..

sys/dev/virtio/network/if_vtnet.c
2061

This should probably move up above the LRO section.

Is this safe? I think so, but I confess that I don't know the low level details in this driver very well.

I believe so, from what I see, the lock exists to synchronize with the taskqueue and to protect some non-atomic counters.

When I read a network driver, I view a lock around rx processing as an indicator there is room for improvement in the design. The reason the lock seems to exist is to serialize rx ring servicing between the ithread (the normal path) and the taskqueue (which is woken if we continually find more packets to process... or maybe if interrupts don't work..?). I don't really understand code at the bottom of vtnet_rx_vq_process(). It seems like interrupts should be disabled if switching to the taskqueue, and enabled if returning from it. It probably has something to do with the "race" mentioned in that code..

I don't quite understand that either. The comment above the definition of VTNET_INTR_DISABLE_RETRIES suggests to me that the idea is:

  • vtnet_rxq_eof() returns 0, so more == 0, i.e., there were no more descriptors to process.
  • vtnet_rxq_enable_intr() returned 1, meaning that we found some completed descriptors on the queue when enabling interrupts.
  • We should call vtnet_rxq_eof() again to collect those newly completed descriptors instead of deferring.

I'm not sure I understand the purpose of the taskqueue at all though. Why can't we handle all packets in the ithread context?

sys/dev/virtio/network/if_vtnet.c
2061

Hmm, logically it should, yes, but then vtnet_lro_rx() is updating state in vtnrx_lro without any synchronization between the taskqueue and ithread.

Is this safe? I think so, but I confess that I don't know the low level details in this driver very well.

I believe so, from what I see, the lock exists to synchronize with the taskqueue and to protect some non-atomic counters.

When I read a network driver, I view a lock around rx processing as an indicator there is room for improvement in the design. The reason the lock seems to exist is to serialize rx ring servicing between the ithread (the normal path) and the taskqueue (which is woken if we continually find more packets to process... or maybe if interrupts don't work..?). I don't really understand code at the bottom of vtnet_rx_vq_process(). It seems like interrupts should be disabled if switching to the taskqueue, and enabled if returning from it. It probably has something to do with the "race" mentioned in that code..

I don't quite understand that either. The comment above the definition of VTNET_INTR_DISABLE_RETRIES suggests to me that the idea is:

  • vtnet_rxq_eof() returns 0, so more == 0, i.e., there were no more descriptors to process.
  • vtnet_rxq_enable_intr() returned 1, meaning that we found some completed descriptors on the queue when enabling interrupts.
  • We should call vtnet_rxq_eof() again to collect those newly completed descriptors instead of deferring.

I'm not sure I understand the purpose of the taskqueue at all though. Why can't we handle all packets in the ithread context?

Indeed, removing the rx taskqueue would simplify things greatly, thereby removing the need for an rx lock entirely.

@bryanv : Can you explain why we need the rx taskqueue?

I don't think we need the taskqueue. It's probably just a design copied from the Intel drivers, and I don't think it makes much sense for those either. The other thing that can be nice to do though when making this change is to instead build a temporary list of packets linked via m_nextpkt (mbufq works for this) and pass an entire batch to if_input. This lets you avoid dropping the lock as often.

In D45950#1073880, @jhb wrote:

I don't think we need the taskqueue. It's probably just a design copied from the Intel drivers, and I don't think it makes much sense for those either. The other thing that can be nice to do though when making this change is to instead build a temporary list of packets linked via m_nextpkt (mbufq works for this) and pass an entire batch to if_input. This lets you avoid dropping the lock as often.

What I dislike about m_nextpkt() is that you trade pointer chasing for lock frobbing. I tend to think an rx ring lock is a kludge that's almost never really needed. But if enough drivers are doing this, we might want to consider an mbuf array method for if_input.

In D45950#1073880, @jhb wrote:

I don't think we need the taskqueue. It's probably just a design copied from the Intel drivers, and I don't think it makes much sense for those either. The other thing that can be nice to do though when making this change is to instead build a temporary list of packets linked via m_nextpkt (mbufq works for this) and pass an entire batch to if_input. This lets you avoid dropping the lock as often.

What I dislike about m_nextpkt() is that you trade pointer chasing for lock frobbing. I tend to think an rx ring lock is a kludge that's almost never really needed. But if enough drivers are doing this, we might want to consider an mbuf array method for if_input.

Hmm, OTOH, passing up a batch might mean you can amortize other costs (e.g. i-cache) with going up the stack? I'm not sure though. I have a WIP branch btw that passes batches of related packets up through several layers up to the point that udp_input() is able to get a batch of packets for the same inpcb here: https://github.com/freebsd/freebsd-src/compare/main...bsdjhb:freebsd:cxgbe_batching It's a bit old and I need to rebase it probably. I have tested that it works, I have not been able to benchmark it to show if it's useful. In some ways it is a more generalized version of LRO, but only for consecutive packets whereas LRO is able to group batches together when packets from multiple connections are interleaved. Possibly one could do more of the classification/batching down in the driver layer similar to LRO but for UDP as well instead. (QUICC would be a UDP use case I think).