Page MenuHomeFreeBSD

vxlan(4): Support for stateless NIC hardware offloads with VXLAN encapsulated traffic.
Needs ReviewPublic

Authored by np on Wed, Jul 29, 12:56 AM.

Details

Summary

This lets a VXLAN pseudo-interface take advantage of hardware
checksumming (tx and rx), TSO, and RSS if the NIC is capable of
performing these operations on inner VXLAN traffic.

A VXLAN interface inherits the capabilities of its vxlandev interface
if one is specified or of the interface that hosts the vxlanlocal
address. If other interfaces will carry traffic for that VXLAN then
they must have the same hardware capabilities.

On transmit, if_vxlan verifies that the outbound interface has the
required capabilities and then translates the CSUM_ flags to their inner
equivalents. This tells the hardware ifnet that it needs to operate on
the inner frame and not the outer VXLAN headers.

An event is generated when a VXLAN ifnet starts. This allows hardware
drivers to configure their devices to expect VXLAN traffic on the
specified incoming port.

On receive, the hardware does RSS and checksum verification on the inner
frame. if_vxlan now does a direct netisr dispatch to take full
advantage of RSS. It is not very clear why it didn't do this already.

According to RFC 7348 a VTEP _must_ accept VXLAN frames with zero UDP
checksum but prior to RFCs 6935 and 6936 IPv6 mandated that a host
_must_ drop UDP/IPv6 frames with zero checksum. These changes add a
single port, meant for VXLAN use, for which UDP/IPv6 will allow zero
checksums. I believe this complies with 6935 and 6936 section 4.

All of these changes were developed and tested with cxgbe(4) hardware.
It should be relatively straightforward to support other stateless
FOO-over-UDP encapsulations with these changes.

Future work:

  • Rx: it should be possible to avoid the first trip up the protocol stack to get the frame to if_vxlan just so it can decapsulate and requeue for a second trip up the stack. The hardware NIC driver could directly call an if_vxlan receive routine for VXLAN traffic instead.
  • Rx: LRO. depends on what happens with the previous item. There will have to to be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.
  • VXLAN over lagg over hw ifnet.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 32733
Build 30175: arc lint + arc unit

Event Timeline

np created this revision.Wed, Jul 29, 12:56 AM
np requested review of this revision.Wed, Jul 29, 12:56 AM
kib retitled this revision from vxlan(4): Support for stateless NIC hardware offloads with VXLAN encapsulated traffic. to vxlan(4): Support for stateless NIC hardware offloads with VXLANencapsulated traffic..Wed, Jul 29, 7:38 AM
kib retitled this revision from vxlan(4): Support for stateless NIC hardware offloads with VXLANencapsulated traffic. to vxlan(4): Support for stateless NIC hardware offloads with VXLAN encapsulated traffic..
kib added a subscriber: menyy_mellanox.com.
hselasky added inline comments.Wed, Jul 29, 9:01 AM
sys/sys/mbuf.h
645

Should you use an enum instead of a bit-mask? Because only one bit will be set at a time!

May you please consider splitting the patch into functional groups?

  • VXLAN HW capabilities
  • Spelling errors
  • Logic cleanup "xxx == 0" vs "!xxx"
  • style cleanup
sys/dev/cxgbe/common/common.h
250

"bool" does not offer a definite storage size. You may consider

uint16_t rx_pkt_encap : 1;

instead.

sys/dev/cxgbe/common/t4_hw.c
9629–9630

Is this comment outdated?

np marked an inline comment as done.Wed, Jul 29, 3:08 PM

May you please consider splitting the patch into functional groups?

  • VXLAN HW capabilities
  • Spelling errors
  • Logic cleanup "xxx == 0" vs "!xxx"
  • style cleanup

I couldn't figure out a good way to post a series of dependent patches for review. Does phabricator support this?

I can split this into at least two logical parts: infrastructure changes (everything except sys/dev) and driver changes (sys/dev) before commit but it's hard to review one without the other so it's one big patch here. Even if I were to split them, which commit should refer to this differential? I think it should be the kernel parts because that's the one I'd really like the reviewers to look at.

sys/dev/cxgbe/common/t4_hw.c
9629–9630

Yes, it is. I'll get rid of it and mark it done.

sys/sys/mbuf.h
645

Some of these csum bits are mutually exclusive (like TCP or UDP, v4 or v6) but others are set together so an enum is not appropriate. For example, a TSO/IPv4 request will have CSUM_TSO, CSUM_IP_TCP and CSUM_IP all set at the same time. The "INNER" bits just mirror the outer bits.

np marked an inline comment as done.Wed, Jul 29, 3:10 PM
np marked an inline comment as not done.
np updated this revision to Diff 75137.Wed, Jul 29, 6:31 PM

Remove comment.

np marked an inline comment as done.Wed, Jul 29, 6:32 PM
jhb added a comment.Wed, Jul 29, 10:52 PM

You can link reviews in phabricator using "Edit related revisions" where "child" revisions depend on "parent" revisions. Once you link revisions you then get a "Stack" tab in the Revision Contents window that shows the related revisions and gives a summary of their state, etc. D25732 is a review I currently have open where I did this (all the GCC6 fixes are children of the knob to turn on a GCC6 tinderbox option).

Aren't we soon running out of flag bits in certain integer values?

How about using send tags for VXLAN offloading?

Is the VXLAN feature compatible with TLS HW offload?

Or will there be a conflict there?

Aren't we soon running out of flag bits in certain integer values?

Given that the inner flags mirror the outer ones, why not use a different variable inner_flags with the same set of flags?

np added a comment.Thu, Jul 30, 7:32 PM

Aren't we soon running out of flag bits in certain integer values?

We're already out of bits in csum_flags (after this revision). I think we can treat the caps as rx caps or tx caps depending on the direction of the mbuf (outbound/inbound) like we do for snd_tag/rcvif. That might buy us some more time. I would have liked to add an inner_csum_flags, and inner_csum_data to struct mbuf but increasing its size is not a popular proposition so I tried to make do with the available bits in csum_flags.

How about using send tags for VXLAN offloading?

Send tags are typically used for stateful offload. We could in theory have send tags that are not allocated from a NIC and are just universal stateless tags but why have half the CSUM_ bits in csum_flags and the rest in a send_tag? csum_flags is the natural place for new CSUM_ bits.

Is the VXLAN feature compatible with TLS HW offload?

Or will there be a conflict there?

This could be handled with something like what vlan does for RATELIMIT and KTLS. Vxlan does not have a tight coupling with the hw interface so it will not be as simple, but it is possible for an mbuf to have a TLS snd_tag and CSUM_INNER_* bits set at the same time.

np added a comment.Thu, Jul 30, 7:57 PM

Aren't we soon running out of flag bits in certain integer values?

Given that the inner flags mirror the outer ones, why not use a different variable inner_flags with the same set of flags?

That would work, but it would increase the size of struct mbuf.

In D25873#573627, @np wrote:

Aren't we soon running out of flag bits in certain integer values?

Given that the inner flags mirror the outer ones, why not use a different variable inner_flags with the same set of flags?

That would work, but it would increase the size of struct mbuf.

So let's move the vxnet specific flags into a mbuf_tag.

np added a comment.Thu, Jul 30, 8:31 PM
In D25873#573627, @np wrote:

Aren't we soon running out of flag bits in certain integer values?

Given that the inner flags mirror the outer ones, why not use a different variable inner_flags with the same set of flags?

That would work, but it would increase the size of struct mbuf.

So let's move the vxnet specific flags into a mbuf_tag.

The goal is to improve vxlan's performance. mbuf tags would add at least an alloc/free and extra access(es) to the tags (which are on a linked list) on the hot path. Why even consider any other location for the flags if there _are_ bits available in the proper place right now. We'll be out of them after this change, we aren't out already.

Looking at all the capability flags, I really wonder if one couldn't save some flag-bits by consolidating some of the seperated-out v4 and v6 flags into codepoints; I would assume that hardware supporting e.g. TSO6 but not TSO4 is quite uncommon (as with CSUM, TLS and WOL support), and some of those flags could be consolidated into codepoints (enum) rather than individual bits - if subsequently there is a dire need to conserve flag bits. IMHO bikeshedding around running low on flag bits in this diff is not strictly necessary.

kib added a comment.Thu, Jul 30, 10:29 PM
In D25873#573638, @np wrote:

The goal is to improve vxlan's performance. mbuf tags would add at least an alloc/free and extra access(es) to the tags (which are on a linked list) on the hot path. Why even consider any other location for the flags if there _are_ bits available in the proper place right now. We'll be out of them after this change, we aren't out already.

Where VNI is currently stored when packet is pushed up or down to driver, for stateless offload ?

np added a comment.Fri, Jul 31, 2:28 AM
In D25873#573657, @kib wrote:
In D25873#573638, @np wrote:

The goal is to improve vxlan's performance. mbuf tags would add at least an alloc/free and extra access(es) to the tags (which are on a linked list) on the hot path. Why even consider any other location for the flags if there _are_ bits available in the proper place right now. We'll be out of them after this change, we aren't out already.

Where VNI is currently stored when packet is pushed up or down to driver, for stateless offload ?

On transmit the VXLAN headers (including the VNI) are filled up by if_vxlan and prepended to the packet in vxlan_encap_header and are available to the driver in the mbuf's payload. On receive the driver does not strip off the VXLAN header and so the VNI is passed on to if_vxlan in the UDP payload and gets consumed in vxlan_rcv_udp_packet. This is how it has always been and there are no changes in this area in this revision.

np updated this revision to Diff 75331.Mon, Aug 3, 9:54 PM

Disable TSO when tx checksumming is disabled.
Add counters for checksums and TSO.