bnxt: Add SR-IOV support for Stratus 100G NIC
ClosedPublic

Authored by bhargava.marreddy_broadcom.com on Nov 27 2017, 2:25 PM.

Details

Summary
  1. Added support for Short-HWRM since it is required for stratus to support SR-IOV
  2. Since few BRCM A0 NICs can not handle < 52 byte Tx pkts, implemented padding logic.
  3. Taken care of other minor issues while claiming VFs.
Test Plan

Tested VF-to-VF, VF-to-PF and VF-to-external cases, Jumbo, VLAN and other basic functionality.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
  1. Added support for Short-HWRM since it is required for stratus to support SR-IOV
  2. Since few BRCM A0 NICs can not handle < 52 byte Tx pkts, implemented padding logic.
  3. Taken care of other minor issues while claiming VFs.
bhargava.marreddy_broadcom.com edited the test plan for this revision. (Show Details)
shurd added a comment.Nov 27 2017, 9:17 PM

It's not clear how this change "Add[s] SR-IOV support for Stratus 100G NIC". A better description should be added.

It's also unclear what "BRCM A0 NICs" refers to.

How does the 52 octet limit interact with encapsulation? If a VLAN is used for example, does the limit change?

bnxt.h
239 ↗(On Diff #35854)

It sounds like this varies by device, and is not a fixed constant for bnxt devices in general.

bnxt_txrx.c
119 ↗(On Diff #35854)

Which NICs? A0 of what chip? Why should all chips have this limit enforce if it only affects a few A0s?

bnxt_txrx.c
119 ↗(On Diff #35854)

Stephen,

Linux driver is doing it for all NICs, I'll check with firmware team and get back.
Meanwhile, do you think this is right way to pad tx pkts? Please let me know if any API present for padding.

Thanks

shurd added inline comments.Nov 30 2017, 9:56 PM
bnxt_txrx.c
119 ↗(On Diff #35854)

Well, it certainly looks wrong... I don't think there's a guarantee that the address has extra bytes available, and nothing is zeroing those bytes.

Further, since the minimum Ethernet frame size is 64, and an Ethernet packet contains a frame, I don't even know how you're generating traffic that would require padding... it seems like those would be invalid Ethernet packets regardless.

How are you testing this padding?

How are you testing this padding?

<Chenna> ARP response pkt is less than 60 bytes, right? I'm able to repro using ARP response pkt only.

shurd added a comment.Dec 1 2017, 5:41 AM

How are you testing this padding?

<Chenna> ARP response pkt is less than 60 bytes, right? I'm able to repro using ARP response pkt only.

The ARP packet itself is 28 bytes, yes... but the Ethernet packet which contains it will pad the payload out to 46 bytes (42 if there's a single VLAN tag), making the Ethernet frame 64 bytes and the Ethernet packet will be 72 bytes. Are you saying the Ethernet *payload* has a minimum limit of 52 bytes on some controllers? Or is it that there's a minimum frame size of 70 bytes?

That is to say, will it work with a 48 byte payload if there's a VLAN tag, and a 44 byte payload if it's double-tagged (which keeps the Ethernet frame at 70 bytes)? This seems most likely.

Regardless, removing the ability to send a minimal, valid Ethernet frame from all future hardware that is capable of it because some flaky A0 chips require a work-around doesn't seem like a very good idea.

If we do need this workaround, it likely needs to be added in iflib so it can append to the mbuf before being passed to the encap function, since the encap function doesn't get mbufs.

This is somewhat strange though as historically, the hardware would pad short TX frames, so the driver would be free to send runt frames and have them fixed by hardware. There were some bge devices that we had to pad out to the Ethernet minimum if checksum offload was enabled due to checksum offload issues, but since you see it in ARP, that's clearly not the case here. I would be very interested in further detail about this issue with "some A0 chips" if they're available (errata, etc).

I wasn't planning on working on a "quirk" abstraction for iflib for another month or two, so having some idea of how many chips this impacts in the wild would be very useful, and limiting the workaround to only chips that require it is very desirable.

shurd added a comment.Dec 1 2017, 5:57 AM

How are you testing this padding?

<Chenna> ARP response pkt is less than 60 bytes, right? I'm able to repro using ARP response pkt only.

The ARP packet itself is 28 bytes, yes... but the Ethernet packet which contains it will pad the payload out to 46 bytes (42 if there's a single VLAN tag), making the Ethernet frame 64 bytes and the Ethernet packet will be 72 bytes. Are you saying the Ethernet *payload* has a minimum limit of 52 bytes on some controllers? Or is it that there's a minimum frame size of 70 bytes?

...

This is somewhat strange though as historically, the hardware would pad short TX frames, so the driver would be free to send runt frames and have them fixed by hardware. There were some bge devices that we had to pad out to the Ethernet minimum if checksum offload was enabled due to checksum offload issues, but since you see it in ARP, that's clearly not the case here. I would be very interested in further detail about this issue with "some A0 chips" if they're available (errata, etc).

I just thought of another possibility... the hardware may only be able to pad frames that are 52 bytes or larger, but it still pads them out to 64 bytes... so basically the "padding offload" feature is broken on these chips. This would mean that legal minimum sized frames can be transmitted etc, just that they need special preparation. This still means an iflib quirk thing, and additional processing when transmitting small packets (where it hurts the worst), but at least all legal frames can still be sent. This seems to match up with your patch too.

Since small packet performance is fairly critical, you absolutely don't want to pay this price for every small packet if the hardware will properly pad outside of a list of A0 chips.

shurd added a comment.Dec 1 2017, 6:57 PM

I just thought of another possibility... the hardware may only be able to pad frames that are 52 bytes or larger, but it still pads them out to 64 bytes... so basically the "padding offload" feature is broken on these chips. This would mean that legal minimum sized frames can be transmitted etc, just that they need special preparation. This still means an iflib quirk thing, and additional processing when transmitting small packets (where it hurts the worst), but at least all legal frames can still be sent. This seems to match up with your patch too.

I've added a new flag that should pad frames to arbitrary sizes as D13324. Please try this out and report back in that review. Setting scctx->isc_min_frame_size to zero for non-broken hardware should prevent the worst of the performance issues for now. When per-instance quirks are added, I'll move the flag into there.

D13324 as is should be padding all frames for all bnxt devices, which you say is the same behaviour as Linux.

bnxt: Removed Tx padding logic from earlier patch

since iflib supporting it as part of D13324.
shurd added a comment.Dec 4 2017, 6:58 PM

Looks good, but I don't really like the title. I'm thinking of a commit message more like the following:

Add support for short HWRM commands

Short HWRM commands are required to suppport VFs on the Stratus 100G devices.
While we're here, remove nvram sysctls from VFs, and fix some big endian issues.

It would have been nice if this had been three separate reviews (endian issues, nvram sysctls, short commands). It's fairly easy to split up when I commit, but all changes need to be approved before any changes are committed when it's a combined patch like this... just something to keep in mind for future reviews.

if_bnxt.c
887 ↗(On Diff #36180)

Since the data this is populated with isn't collected for VFs anymore, this should be in a if (BNXT_PF(softc)) { block along with the allocation/freeing of softc->nvm_info->nvm_ctx in bnt_sysctl.c as well.

We don't want a bunch of useless data in the sysctl output.

This revision was automatically updated to reflect the committed changes.
shurd reopened this revision.Dec 5 2017, 9:02 PM

Wrong URL pasted in D13324 commit. :-(

This revision was not accepted when it landed; it landed in state Needs Review.Dec 19 2017, 9:07 PM
Closed by commit rS327000: Support short HWRM commands (authored by shurd, committed by ). · Explain Why
This revision was automatically updated to reflect the committed changes.