Page MenuHomeFreeBSD

Optimise LRO for RSS
ClosedPublic

Authored by hselasky on Jan 13 2016, 7:47 AM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jan 21, 8:44 PM
Unknown Object (File)
Sun, Jan 19, 8:19 PM
Unknown Object (File)
Fri, Jan 17, 11:03 PM
Unknown Object (File)
Fri, Jan 17, 9:47 PM
Unknown Object (File)
Fri, Jan 17, 9:17 PM
Unknown Object (File)
Mon, Jan 13, 5:47 AM
Unknown Object (File)
Sat, Jan 11, 7:15 PM
Unknown Object (File)
Sat, Jan 11, 5:28 PM

Details

Summary

Add optimizing LRO wrapper:

  • Add optimizing LRO wrapper which pre-sorts all incoming packets according to the hash type and flowid. This prevents exhaustion of the LRO entries due to too many connections at the same time. Testing using a larger number of higher bandwidth TCP connections showed that the incoming ACK packet aggregation rate increased from ~1.3:1 to almost 3:1. Another test showed that for a number of TCP connections greater than 16 per hardware receive ring, where 8 TCP connections was the LRO active entry limit, there was a significant improvement in throughput due to being able to fully aggregate more than 8 TCP stream. For very few very high bandwidth TCP streams, the optimizing LRO wrapper will add CPU usage instead of reducing CPU usage. This is expected. Network drivers which want to use the optimizing LRO wrapper needs to call "tcp_lro_queue_mbuf()" instead of "tcp_lro_rx()" and "tcp_lro_flush_all()" instead of "tcp_lro_flush()". Further the LRO control structure must be initialized using "tcp_lro_init_args()" passing a non-zero number into the "lro_mbufs" argument.
  • Make LRO statistics 64-bit. Previously 32-bit integers were used for statistics which can be prone to wrap-around. Fix this while at it and update all SYSCTL's which expose LRO statistics.
  • Ensure all data is freed when destroying a LRO control structures, especially leftover LRO entries.
  • Reduce number of memory allocations needed when setting up a LRO control structure by precomputing the total amount of memory needed.
  • Add own memory allocation counter for LRO.
  • Bump the FreeBSD version to force recompilation of all KLDs due to change of the LRO control structure size.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

hselasky retitled this revision from to Optimise LRO for RSS.
hselasky updated this object.
hselasky edited the test plan for this revision. (Show Details)
hselasky added reviewers: gallatin, scottl.
hselasky set the repository for this revision to rS FreeBSD src repository - subversion.
hselasky edited edge metadata.

Fix minor style nit.

hselasky edited edge metadata.

Rebase patch to latest 11-current.

hselasky edited edge metadata.

Fix all build issues for AMD64 due to changing sizeof LRO statistics variables.

Initial testing shows that for 64 simulaneous TCP connections on the same RX ring utilizing 100% CPU in both testcases, the throughput roughly doubles:

Before RSS LRO: 4-5GBit/s
With MLX5EN modified to use RSS LRO: 10.4 GBit/s

Tested using iperf on a 100GBit/s link.

hselasky edited edge metadata.

Optimised away one "if". Suggested by Mark Bloch.

hselasky edited edge metadata.

Fix whitespace: Space to tabs conversion.

sbruno edited edge metadata.

Unblocking "IntelNetworking" reviewers for the changes to ixl(4).

Look good to me. It also refactors out most commonly used lro_flush pattern in many drivers, i.e. when rx_pkts > rx_pkts_max, flush all active LRO entries.

sys/netinet/tcp_lro.c
753–754 ↗(On Diff #12262)

Or just KASSERT these?

sys/netinet/tcp_lro.c
753–754 ↗(On Diff #12262)

I was thinking that if drivers for some reason call this code before or after init, that it wouldn't panic. KASSERT() would work fine for MLX5EN, but I'm not sure how other network drivers are tearing down their RX parts.

rrs added a reviewer: rrs.
rrs added a subscriber: rrs.

This looks great though I am not sure all the names make sense.. it might be better
to rename some of the things so its not confusing with respect to the old way that the
LRO code worked.

sys/netinet/tcp_lro.c
753–754 ↗(On Diff #12262)

If you KASSERT then you still need the free for non-invariant case otherwise we would
see mbuf leaks if the condition happened.

This revision is now accepted and ready to land.Jan 13 2016, 4:33 PM

@rrs Do you have suggestions for better function names?

BTW: I think this patch has a positive effect on the CPU cache utilization. The PCBGROUP feature selects bucket based on the flowid, and when mbufs are sorted according to the flowids, the likelyhood of the next bucket selected already being in the CPU cache increases. Further if the PCBGROUP code sorted the entries in the the lists by "flowid", the lookup time can be reduced further, at best into a single-pass for the LISTS's involved, iff the last looked up entry can be cached/stored somehow ...

General comments:

  • First, this is an awesome idea. In practice, with machines with tens of thousands of connections, this increases our aggregation rate from ~1.3:1 to almost 3:1. Just fantastic. Thanks for thinking out of the box!
  • Naming: I understand why you named things the way you did (tcp_lro_rss_flush, tcp_lro_rss_rx), but perhaps the code would be a bit more understandable with different names. The names caused me substantial confusion, and also caused confusion when I was walking another person through this review. I think avoiding duplicating the _rx and _flush names would be helpful to eliminate confusion between the new functions, and the old ones that the wrap. Perhaps tcp_lro_rss_rx() could be something like tcp_lro_store_mb() ? And tcp_lro_rss_flush could be something like tcp_lro_rss_submit()?
  • It would be interesting to see stats on RSS hash collisions (eg, the max number of active aggregations). It may be that we could reduce the number of LRO aggregations down to 2. One that is almost always used, and one for the rare case of a collision.

I'll wait and see a bit if there are more comments popping in and then generate a new patch with updated function names:

tcp_lro_rss_rx -> tcp_lro_put_mbuf
tcp_lro_rss_init -> tcp_lro_init_mbuf
tcp_lro_rss_flush -> tcp_lro_flush_mbufs
tcp_lro_rss_free -> tcp_lro_free (merge into one function only)

hselasky edited edge metadata.

Updated function naming.

This revision now requires review to proceed.Jan 14 2016, 6:05 AM
hselasky edited edge metadata.

Add some predict false macros.

Add better malloc pool description.

gallatin edited edge metadata.
This revision is now accepted and ready to land.Jan 19 2016, 2:16 PM
gnn added a reviewer: gnn.
This revision was automatically updated to reflect the committed changes.