Details

Reviewers

kib
rrs

Group Reviewers

transport

Commits

rGe31579b8558d: Add TCP LRO support for VLAN and VxLAN.
rGa9b66dbd9169: Allow the tcp_lro_flush_all() function to be called when the control
rG9ca874cf740e: Add TCP LRO support for VLAN and VxLAN.

Summary

Speedup header data comparison. Don't compare field by field, but instead use unsigned long.
Make smaller functions doing one thing at a time instead of big functions with lots of repeated code.
Try to refactor the TCP ACK compression code to be more readable. Predict number of ACKs which needs compression.
Use sbintime() for all time-keeping.
Try to shrink the size of the LRO entry, because it is frequently zeroed.

MFC after: 1 week
Sponsored by: Mellanox Technologies // NVIDIA Networking

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

• hselasky created this revision.Apr 2 2021, 10:37 PM

Herald added a reviewer: transport. · View Herald TranscriptApr 2 2021, 10:37 PM

Herald added a subscriber: melifaro. · View Herald Transcript

• hselasky requested review of this revision.Apr 2 2021, 10:37 PM

Minor bugfix.

Fix UDP checksumming, when there is no UDP checksum. Optimise other checksumming.

Test OK: VXLAN
Test OK: NO-VLAN
Test OK: VLAN
TODO: PRIO-TAGGED

• hselasky added a reviewer: kib.Apr 3 2021, 10:40 AM

Add fixes for BBR/RACK.

Incorporate a fix for bad assert / panic.

gallatin added a subscriber: gallatin.Apr 7 2021, 3:03 PM

I may be reading this wrong, but for mbuf compression, you now seem to be appending a chain of packets, then coalescing them. The original design choice of coalescing as the new packet was encountered was intentional. The idea is that you want to operate on and free the incoming mbuf while it is hot in cache. By chaining mbufs and dealing with them later, they're likely to be cold in cache, so you're likely encounter additional cache misses on every mbuf in the chain as you copy out data and free it.

Hi Drew,

For sorted LRO the mbufs are still hot in the CPU cache. Remember we are flushing all the entries after each new flowid value!

--HPS

In D29564#664209, @hselasky wrote:

Hi Drew,

For sorted LRO the mbufs are still hot in the CPU cache. Remember we are flushing all the entries after each new flowid value!

--HPS

Remember that very few drivers use sorted LRO. Iflib based drivers do not, which is the vast majority of network traffic in FreeBSD.

(I tried to add sorted LRO to iflib, but wound up causing a small packet forwarding regression, so I abandoned the effort)

The advantage with my approach is:

Lookup the INP only once.
Less error handling.

Won't the same problem happen looking up INP's from 1000's of connections, that the CPU runs out of cache?

Then it is better to only process packets for one INP at a time?

@gallatin: Who can test this patch with non-sorted-LRO and TCPHPTS, to verify performance improvements or losses?

Incorporate changes from Randall.
Some minor style and spelling fixes.

Include more changes from Randall.

Fix a typo in comment.

Add minor fix from Randall: Don't try to compress TCP packets with data payload.

Isn't it a wrong layering?

Why not add lro hooks to the input processing code in vlan and vxlan interfaces, instead of coding incomplete understanding of the encapsulation formats directly into TCP stack? Let drivers handle all details of decapsulation and then TCP would see inner packets. I do not think that we gain _a lot_ by coalescing encapsulated packets, while this definitely breaks layering and might also break some features like firewalls (not sure about this).

Isn't it a wrong layering?

No, its an acceleration feature like TSO is.

LRO is disabled for routing, when IP forwarding is enabled.

Why not add lro hooks to the input processing code in vlan and vxlan interfaces, instead of coding incomplete understanding of the encapsulation formats directly into TCP stack? Let drivers handle all details of decapsulation and then TCP would see inner packets. I do not think that we gain _a lot_ by coalescing encapsulated packets, while this definitely breaks layering and might also break some features like firewalls (not sure about this).

Because that requires more CPU. Then you have to sort the incoming data multiple times, instead of only once. Also how would you implement lookahead of multiple packets for the same destination?

LRO is not for all encapsulation types. When encryption is used, LRO is not practical, except for TLS over TCP.

--HPS

I have done a complete set of regression tests in the NF lab on this patch. They
now all pass 100%. We have also take this patch to live NF machines and
done some performance analysis on 100G machines running 100% TLS.
The overall result is that you lose about 1% CPU in the old path and
you gain 2% CPU in the compressed ack path. So overall this is
a slight win or push. This is more than acceptable considering the new flexibility it
gives us!

This revision is now accepted and ready to land.Apr 19 2021, 11:35 AM

In D29564#669391, @kib wrote:

Isn't it a wrong layering?

Why not add lro hooks to the input processing code in vlan and vxlan interfaces, instead of coding incomplete understanding of the encapsulation formats directly into TCP stack? Let drivers handle all details of decapsulation and then TCP would see inner packets. I do not think that we gain _a lot_ by coalescing encapsulated packets, while this definitely breaks layering and might also break some features like firewalls (not sure about this).

I don't know the details well enough to form a strong option on this but I am wondering if it would be possible to generalize in the future to a dispatch table or somehow invert control back into the tunnel implementations. In particular I think this would be useful for other lite IP tunnels like GENEVE, gif(4), gre(4), and maybe the unfinished vpc(4) stack. My concern, if any, would be stated as let's not build ourselves into a corner for just VXLAN. If there is an opening for future tunnels, that is good.