Page MenuHomeFreeBSD

vxlan(4): Support for stateless NIC hardware offloads with VXLAN encapsulated traffic.
ClosedPublic

Authored by np on Jul 29 2020, 12:56 AM.

Details

Summary

This lets a VXLAN pseudo-interface take advantage of hardware
checksumming (tx and rx), TSO, and RSS if the NIC is capable of
performing these operations on inner VXLAN traffic.

A VXLAN interface inherits the capabilities of its vxlandev interface
if one is specified or of the interface that hosts the vxlanlocal
address. If other interfaces will carry traffic for that VXLAN then
they must have the same hardware capabilities.

On transmit, if_vxlan verifies that the outbound interface has the
required capabilities and then translates the CSUM_ flags to their inner
equivalents. This tells the hardware ifnet that it needs to operate on
the inner frame and not the outer VXLAN headers.

An event is generated when a VXLAN ifnet starts. This allows hardware
drivers to configure their devices to expect VXLAN traffic on the
specified incoming port.

On receive, the hardware does RSS and checksum verification on the inner
frame. if_vxlan now does a direct netisr dispatch to take full
advantage of RSS. It is not very clear why it didn't do this already.

According to RFC 7348 a VTEP _must_ accept VXLAN frames with zero UDP
checksum but prior to RFCs 6935 and 6936 IPv6 mandated that a host
_must_ drop UDP/IPv6 frames with zero checksum. These changes add a
single port, meant for VXLAN use, for which UDP/IPv6 will allow zero
checksums. I believe this complies with 6935 and 6936 section 4.

All of these changes were developed and tested with cxgbe(4) hardware.
It should be relatively straightforward to support other stateless
FOO-over-UDP encapsulations with these changes.

Future work:

  • Rx: it should be possible to avoid the first trip up the protocol stack to get the frame to if_vxlan just so it can decapsulate and requeue for a second trip up the stack. The hardware NIC driver could directly call an if_vxlan receive routine for VXLAN traffic instead.
  • Rx: LRO. depends on what happens with the previous item. There will have to to be a mechanism to indicate that it's time for if_vxlan to flush its LRO state.
  • VXLAN over lagg over hw ifnet.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

np requested review of this revision.Jul 29 2020, 12:56 AM
kib retitled this revision from vxlan(4): Support for stateless NIC hardware offloads with VXLAN encapsulated traffic. to vxlan(4): Support for stateless NIC hardware offloads with VXLANencapsulated traffic..Jul 29 2020, 7:38 AM
kib retitled this revision from vxlan(4): Support for stateless NIC hardware offloads with VXLANencapsulated traffic. to vxlan(4): Support for stateless NIC hardware offloads with VXLAN encapsulated traffic..
kib added a subscriber: menyy_mellanox.com.
sys/sys/mbuf.h
645 ↗(On Diff #75099)

Should you use an enum instead of a bit-mask? Because only one bit will be set at a time!

May you please consider splitting the patch into functional groups?

  • VXLAN HW capabilities
  • Spelling errors
  • Logic cleanup "xxx == 0" vs "!xxx"
  • style cleanup
sys/dev/cxgbe/common/common.h
250 ↗(On Diff #75099)

"bool" does not offer a definite storage size. You may consider

uint16_t rx_pkt_encap : 1;

instead.

sys/dev/cxgbe/common/t4_hw.c
9631–9633 ↗(On Diff #75099)

Is this comment outdated?

np marked an inline comment as done.Jul 29 2020, 3:08 PM

May you please consider splitting the patch into functional groups?

  • VXLAN HW capabilities
  • Spelling errors
  • Logic cleanup "xxx == 0" vs "!xxx"
  • style cleanup

I couldn't figure out a good way to post a series of dependent patches for review. Does phabricator support this?

I can split this into at least two logical parts: infrastructure changes (everything except sys/dev) and driver changes (sys/dev) before commit but it's hard to review one without the other so it's one big patch here. Even if I were to split them, which commit should refer to this differential? I think it should be the kernel parts because that's the one I'd really like the reviewers to look at.

sys/dev/cxgbe/common/t4_hw.c
9631–9633 ↗(On Diff #75099)

Yes, it is. I'll get rid of it and mark it done.

sys/sys/mbuf.h
645 ↗(On Diff #75099)

Some of these csum bits are mutually exclusive (like TCP or UDP, v4 or v6) but others are set together so an enum is not appropriate. For example, a TSO/IPv4 request will have CSUM_TSO, CSUM_IP_TCP and CSUM_IP all set at the same time. The "INNER" bits just mirror the outer bits.

np marked an inline comment as done.Jul 29 2020, 3:10 PM
np marked an inline comment as not done.
np marked an inline comment as done.Jul 29 2020, 6:32 PM

You can link reviews in phabricator using "Edit related revisions" where "child" revisions depend on "parent" revisions. Once you link revisions you then get a "Stack" tab in the Revision Contents window that shows the related revisions and gives a summary of their state, etc. D25732 is a review I currently have open where I did this (all the GCC6 fixes are children of the knob to turn on a GCC6 tinderbox option).

Aren't we soon running out of flag bits in certain integer values?

How about using send tags for VXLAN offloading?

Is the VXLAN feature compatible with TLS HW offload?

Or will there be a conflict there?

Aren't we soon running out of flag bits in certain integer values?

Given that the inner flags mirror the outer ones, why not use a different variable inner_flags with the same set of flags?

Aren't we soon running out of flag bits in certain integer values?

We're already out of bits in csum_flags (after this revision). I think we can treat the caps as rx caps or tx caps depending on the direction of the mbuf (outbound/inbound) like we do for snd_tag/rcvif. That might buy us some more time. I would have liked to add an inner_csum_flags, and inner_csum_data to struct mbuf but increasing its size is not a popular proposition so I tried to make do with the available bits in csum_flags.

How about using send tags for VXLAN offloading?

Send tags are typically used for stateful offload. We could in theory have send tags that are not allocated from a NIC and are just universal stateless tags but why have half the CSUM_ bits in csum_flags and the rest in a send_tag? csum_flags is the natural place for new CSUM_ bits.

Is the VXLAN feature compatible with TLS HW offload?

Or will there be a conflict there?

This could be handled with something like what vlan does for RATELIMIT and KTLS. Vxlan does not have a tight coupling with the hw interface so it will not be as simple, but it is possible for an mbuf to have a TLS snd_tag and CSUM_INNER_* bits set at the same time.

Aren't we soon running out of flag bits in certain integer values?

Given that the inner flags mirror the outer ones, why not use a different variable inner_flags with the same set of flags?

That would work, but it would increase the size of struct mbuf.

In D25873#573627, @np wrote:

Aren't we soon running out of flag bits in certain integer values?

Given that the inner flags mirror the outer ones, why not use a different variable inner_flags with the same set of flags?

That would work, but it would increase the size of struct mbuf.

So let's move the vxnet specific flags into a mbuf_tag.

In D25873#573627, @np wrote:

Aren't we soon running out of flag bits in certain integer values?

Given that the inner flags mirror the outer ones, why not use a different variable inner_flags with the same set of flags?

That would work, but it would increase the size of struct mbuf.

So let's move the vxnet specific flags into a mbuf_tag.

The goal is to improve vxlan's performance. mbuf tags would add at least an alloc/free and extra access(es) to the tags (which are on a linked list) on the hot path. Why even consider any other location for the flags if there _are_ bits available in the proper place right now. We'll be out of them after this change, we aren't out already.

Looking at all the capability flags, I really wonder if one couldn't save some flag-bits by consolidating some of the seperated-out v4 and v6 flags into codepoints; I would assume that hardware supporting e.g. TSO6 but not TSO4 is quite uncommon (as with CSUM, TLS and WOL support), and some of those flags could be consolidated into codepoints (enum) rather than individual bits - if subsequently there is a dire need to conserve flag bits. IMHO bikeshedding around running low on flag bits in this diff is not strictly necessary.

In D25873#573638, @np wrote:

The goal is to improve vxlan's performance. mbuf tags would add at least an alloc/free and extra access(es) to the tags (which are on a linked list) on the hot path. Why even consider any other location for the flags if there _are_ bits available in the proper place right now. We'll be out of them after this change, we aren't out already.

Where VNI is currently stored when packet is pushed up or down to driver, for stateless offload ?

In D25873#573657, @kib wrote:
In D25873#573638, @np wrote:

The goal is to improve vxlan's performance. mbuf tags would add at least an alloc/free and extra access(es) to the tags (which are on a linked list) on the hot path. Why even consider any other location for the flags if there _are_ bits available in the proper place right now. We'll be out of them after this change, we aren't out already.

Where VNI is currently stored when packet is pushed up or down to driver, for stateless offload ?

On transmit the VXLAN headers (including the VNI) are filled up by if_vxlan and prepended to the packet in vxlan_encap_header and are available to the driver in the mbuf's payload. On receive the driver does not strip off the VXLAN header and so the VNI is passed on to if_vxlan in the UDP payload and gets consumed in vxlan_rcv_udp_packet. This is how it has always been and there are no changes in this area in this revision.

Disable TSO when tx checksumming is disabled.
Add counters for checksums and TSO.

Handle rxcsum and rxcsum6 correctly.

Tx counters that track hw assistance should consider all inner traffic
irrespective of the IP version of the outer encapsulation.

Do not set CSUM_L3_CALC or CSUM_L3_VALID for received IPv6 frames; it doesn't
have an L3 checksum. This also allows if_vxlan to distinguish between IPv4 and
IPv6 traffic without having to inspect the inner frame. It needs the IP version
to support independent control of rxcsum and rxcsum6 capabilities.

This patch naturally splits into many chunks, for instance RFC7348 port, inner checksums infra, events, driver bits. Do you plan to do more fine-grained commits that one big ci ?

sbin/ifconfig/ifconfig.8
601 ↗(On Diff #75903)

... inner checksum ... ?

603 ↗(On Diff #75903)

s/issued/configured/ ?

share/man/man4/vxlan.4
203 ↗(On Diff #75903)

.Er ENXIO

sys/dev/cxgbe/t4_main.c
11276 ↗(On Diff #75903)

Why the arg argument has void type instead of vxlan_evargs ?

sys/dev/cxgbe/t4_sge.c
2501 ↗(On Diff #75903)

Hmm, does it make sense to extract all that helpers into some common infrastructure header ?

sys/net/if_vxlan.c
1590 ↗(On Diff #75903)

Does this need a reason string ?

1593 ↗(On Diff #75903)

It should be if_printf.

1728 ↗(On Diff #75903)

There seems to be absent any synchronization between vxlan_start invocations, same for vxlan_stop. I see that you added refcount to maintain tcam entry liveness in the driver, but I suspect that we already have a problem at the higher level.

I believe that we want much more strict interface there, in the simplest case we can have global sx that protects all vxlan configuration actions.

2509 ↗(On Diff #75903)

Should this printf be rate-limited ? And again, perhaps if_printf for the nexthop ifnet ?

Perhaps we should add a counter for specific error, instead of potentially flooding printf ?

np marked 7 inline comments as done.Aug 19 2020, 8:38 PM
np added inline comments.
sbin/ifconfig/ifconfig.8
603 ↗(On Diff #75903)

The language was copy-pasted from the description of vlan capabilities. I'll fix both.

sys/dev/cxgbe/t4_main.c
11276 ↗(On Diff #75903)

It is invoked indirectly via t4_iterate and matches what t4_iterate deals with:

void t4_iterate(void (*)(struct adapter *, void *), void *);

sys/dev/cxgbe/t4_sge.c
2501 ↗(On Diff #75903)

They could be moved but are all needed in t4_sge.c only.

sys/net/if_vxlan.c
1590 ↗(On Diff #75903)

Yes, that's a good idea. I added if_printf's for all cases where we wanted to set the zero checksum port but couldn't.

1728 ↗(On Diff #75903)

I originally wanted to generate these events during vxlan create/destroy but the port can be changed when the vxlan interface is down. I don't think changing the VNI, addresses, or ports after create is a common use case so I'd even be ok with freezing them on create. start/stop could then be invoked in create/destroy and wouldn't need a lock. What do you (and others who may be reading this) think?

Otherwise I'll add a global sx.

2509 ↗(On Diff #75903)

This is leftover debug. There is value in displaying the caps that mismatched so I changed this to a rate limited if_printf.

np marked 4 inline comments as done.

incorporate some feedback from kib@

In D25873#579567, @kib wrote:

This patch naturally splits into many chunks, for instance RFC7348 port, inner checksums infra, events, driver bits. Do you plan to do more fine-grained commits that one big ci ?

I'll definitely split the kernel and driver parts before commit but wasn't planning to split the kernel parts any further.

sys/dev/cxgbe/t4_sge.c
2501 ↗(On Diff #75903)

I mean, this seems to be useful for other drivers.

sys/net/if_vxlan.c
1728 ↗(On Diff #75903)

I do not follow. Are you suggesting that there would be no events to driver on up/down of the vxlan child interface ? I am not sure we can handle it (we might be able, but I am not sure).

Unsynchronized events could cause driver to see paradoxical updates, like your vxlan_refcounter underflow.

sys/net/if_vxlan.c
1728 ↗(On Diff #75903)

The events will become vxlan_create/destroy rather that vxlan_start/stop and will be generated exactly once per vxlan interface rather than on every up and down. create/destroy do not race like start/stop so we wouldn't need a lock. But this requires the port to be immutable after create and that's not the case today.

The driver would do exactly what it does in the start/stop events right now.

And the refcount underflow check is for the case where a VXLAN is created/started before the NIC driver is kldloaded. It will miss the create/start event but will see the destroy/stop and get confused. The real fix would be to have a count of VXLAN interfaces somewhere and have the driver seed its initial refcount with this value. But if_vxlan itself is a loadable module so this counter would have to be in the kernel proper, or else there would have to be a query function in if_vxlan that is called indirectly (like the vlan functions). I thought all this was too complicated for an unusual case. If users make sure all NIC drivers are loaded before creating VXLANs then there is no problem with missed events.

In D25873#580051, @np wrote:

And the refcount underflow check is for the case where a VXLAN is created/started before the NIC driver is kldloaded. It will miss the create/start event but will see the destroy/stop and get confused. The real fix would be to have a count of VXLAN interfaces somewhere and have the driver seed its initial refcount with this value. But if_vxlan itself is a loadable module so this counter would have to be in the kernel proper, or else there would have to be a query function in if_vxlan that is called indirectly (like the vlan functions). I thought all this was too complicated for an unusual case. If users make sure all NIC drivers are loaded before creating VXLANs then there is no problem with missed events.

I mean that without a lock, sequence of CREATE/DESTROY/CREATE/DESTROY can be seen by the driver as CREATE/DESTROY/DESTROY/CREATE.

That said, mce(4) creates flow tables ('populates TCAM' is probably the closest) on interface up. We cannot easily do that when vxlan is created if upper interface is down.

In D25873#580052, @kib wrote:

That said, mce(4) creates flow tables ('populates TCAM' is probably the closest) on interface up. We cannot easily do that when vxlan is created if upper interface is down.

For vlans, we also update flow table and modify vport context when the physical interface is up. Since the only information we need to know about vlans is its number (pcp is inline), it is easy. If we need to remember sockaddrs, it is doable but somewhat more cumbersome. Would we need more information, driver starts mirroring the database from vxlan module.

Add a global sx and hold it around the VXLAN start/stop events.

sys/net/if_vxlan.c
1747 ↗(On Diff #76383)

I do not quite understand what this locking achieve. It serializes calls into driver, but driver can do that internally as well.

We discussed that all ups and downs for vxlan should be mutually excluded to avoid reordering of events as seen by driver.

Serial all VXLAN start/stops (even across different VXLAN interfaces).

Thanks for the review, kib@.

Is anyone else interested in reviewing this? I'll commit this in about 3 days from now (Sep 4) unless someone tells me they're planning to review this and want me to wait.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 18 2020, 1:38 AM
This revision was automatically updated to reflect the committed changes.