Page MenuHomeFreeBSD

vmxnet3: skip zero-length descriptor in the middle of a packet
ClosedPublic

Authored by avg on Nov 30 2021, 1:56 PM.

Details

Summary

Passing up such descriptors to iflib is obviously wasteful.
But the main conern is that we may overrun iri_frags array because of
them. That's been observed in practice.

Also, assert that the number of fragments / descriptors / segments is
less than IFLIB_MAX_RX_SEGS.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

avg requested review of this revision.Nov 30 2021, 1:56 PM
avg added inline comments.
sys/dev/vmware/vmxnet3/if_vmx.c
1616

I am not sure if this really needs to be accounted separately, but I wanted additional visibility while testing the change.
Also, not sure about the name...

gallatin added a subscriber: gallatin.
gallatin added inline comments.
sys/dev/vmware/vmxnet3/if_vmx.c
1607

I assume rxcd->len != 0 is by far the common case, so I'd annotate with __predict_true() to help the compiler (and the reader).

This revision is now accepted and ready to land.Nov 30 2021, 2:10 PM

add __predict_true for the normal case

This revision now requires review to proceed.Nov 30 2021, 2:17 PM
avg marked an inline comment as done.Nov 30 2021, 2:18 PM
avg added inline comments.
sys/dev/vmware/vmxnet3/if_vmx.c
1607

Thank you for the suggestion!

This revision is now accepted and ready to land.Nov 30 2021, 5:00 PM

We can only guess as to what may be really going on in the virtual device, but this seems like a correct defensive approach to what has been observed. Given the data at https://people.freebsd.org/~avg/vmxnet3-fragment-overrun.txt (thanks for collecting this!), I was concerned that there might be an issue with the fact that the virtual device is consuming free list entries for the zero-length fragments, but we are hiding this information from iflib, and that might break free list maintenance. I reviewed those paths, and it's fine (iflib tracks and replaces packet mbufs sent to the stack via a bitmap, provides the corresponding free list indexes to the driver during refill, and during refill, the vmx driver correctly identifies and handles gaps in those provided indexes).

With respect to nrxsg_max, with this fix we are still hoping that the virtual device will not give us more than that many non-zero length fragments as part of a single packet (the one thing this episode seems to indicate is that the virtual device is willing to consume more than nrxsg_max free list entries for a single packet, which was not believed to be possible before). The additional KASSERT now checks this for INVARIANTS builds. If we were willing to pay the runtime cost of checking nfrags against IFLIB_MAX_RX_SEGS in non-INVARIANTS builds, that error condition could be handled by setting rxcd->error and breaking out of the loop before incrementing nfrags past IFLIB_MAX_RX_SEGS. Not sure it's worth it without first seeing proof of such a broken virtual device implementation.

I think the name for the new counter (and its corresponding sysctl), "vxcr_zero_length_after_sop", could be better as this name can be taken to exclude a zero length fragment that has sop set, but it does count those. Calling it vcxr_zero_length_frag might be better. Then we'd have vxcr_zero_length counting zero length packets and vcxr_zero_length_frag counting zero length fragments within non-zero length packets. The extent to which that's really really truly true does depend on some assumptions the driver still makes on the fragment structure of packets. Not going to hold up the fix over this.

There remain assumptions on the fragment structure of packets in this driver. vmxnet3_isc_rxd_available() essentially assumes that the queue will never fill with packets that end in a zero-length fragment (if this does happen, receive processing would stall), and vmxnet3_isc_rxd_pkt_get() effectively assumes that a packet that starts with a zero length fragment will consist of only one fragment.

avg marked an inline comment as done.Dec 1 2021, 8:02 AM

@pkelsey , thank you very much for the very detailed review and feedback.
It will certainly be useful if another problem (or a variation of the current one) strikes us.

I thought a bit about gracefully handling too many (non-empty) fragments for a packet.
My only concern is what happens if we abort after seeing SOP and before reaching EOP.
Might that confuse the next call to vmxnet3_isc_rxd_available / vmxnet3_isc_rxd_pkt_get?

better name for the new counter and its sysctl

This revision now requires review to proceed.Dec 1 2021, 8:07 AM
In D33189#750472, @avg wrote:

I thought a bit about gracefully handling too many (non-empty) fragments for a packet.
My only concern is what happens if we abort after seeing SOP and before reaching EOP.
Might that confuse the next call to vmxnet3_isc_rxd_available / vmxnet3_isc_rxd_pkt_get?

You are right, I overlooked the need to still advance cqidx to eop in that case (because the new packet detection code does not scan for sop - it assumes the next descriptor is always sop when it begins its work). I was focused on the free list management implications for fragments that had already been identified and recorded before IFLIB_MAX_RX_SEGS was reached as well as fragments that would not be recorded due to skipping them in an error handling path. Those observations I think are still sound - following the rxcd->error processing for fragments recorded to that point would be ok, as would skipping over any other fragments not yet recorded.

This revision is now accepted and ready to land.Dec 1 2021, 8:35 PM