Page MenuHomeFreeBSD

Add batched packet processing support.
Needs ReviewPublic

Authored by melifaro on Dec 24 2020, 2:05 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Mar 30, 5:01 PM
Unknown Object (File)
Mar 8 2024, 2:38 PM
Unknown Object (File)
Jan 11 2024, 9:51 PM
Unknown Object (File)
Nov 4 2023, 4:05 AM
Unknown Object (File)
Oct 31 2023, 8:27 AM
Unknown Object (File)
Oct 3 2023, 3:58 AM
Unknown Object (File)
Sep 29 2023, 8:26 AM
Unknown Object (File)
Jun 25 2023, 8:58 PM

Details

Reviewers
olivier
Group Reviewers
transport
Summary
NOTE: this is an early-version to gather high-level feedback & get the sense on rate of improvement.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 35719
Build 32608: arc lint + arc unit

Event Timeline

melifaro added a reviewer: olivier.
sys/net/netisr_internal.h
50–51

Missing

typedef void *netisr_handler_m_t;

here ?

Without, I meet this error during a buildworld:

--- all_subdir_usr.bin/netstat ---
In file included from /usr/src/usr.bin/netstat/netisr.c:44:
/usr/src/amd64.amd64/tmp/usr/include/net/netisr_internal.h:64:2: error: unknown type name 'netisr_handler_m_t'; did you mean 'netisr_handler_t'?
    netisr_handler_m_t *np_handler_m;/* Protocol handler. */
    ^                                                                                                                                   
/usr/src/amd64.amd64/tmp/usr/include/net/netisr_internal.h:50:15: note: 'netisr_handler_t' declared here
    typedef void *netisr_handler_t;
                            ^

Fix netisr_handler_m_t definition for userland.

sys/netinet/ip_fastfwd.c
451

This printf needs to be commented :-)

Small device:

x 7f4e724829e (2020/12/27): inet packets-per-second forwarded
+ 7f4e724829e (2020/12/27) with D27755: inet packets-per-second forwarded
+--------------------------------------------------------------------------+
|+  +       +   +          +        x               xx               x    x|
|                                         |__________M___A______________|  |
| |_________A__________|                                                   |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        897815      910126.5        903309      904485.3     4890.7439
+   5        886306        894945        889997      889957.2     3447.4139
Difference at 95.0% confidence
        -14528.1 +/- 6170.78
        -1.60623% +/- 0.674941%
        (Student's t, pooled s = 4231.08)

Big device:

x 7f4e724829e (2020/12/27): inet packets-per-second forwarded
+ 7f4e724829e (2020/12/27) with D27755: inet packets-per-second forwarded
+--------------------------------------------------------------------------+
|      +                                                                   |
|+  +  +   +                                                      x  x xx x|
|                                                                  |__AM_| |
| |___AM__|                                                                |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      33097916      33161594      33139005      33133205     23817.267
+   5      32595089      32676140      32638309      32634321     29675.115
Difference at 95.0% confidence
        -498884 +/- 39241
        -1.50569% +/- 0.117739%
        (Student's t, pooled s = 26906.1)

Flamegraphs here

Add batching for some tcp_lro-based paths.

Thank you for spending cycles on testing it!

Small device:

Is it AMD_GX-412TC_4Cores-Intel_i210AT ?

Flamegraphs here

So, there was no chance in having any improvements here as all traffic goes to tcp_lro_..() which passes packet to if_input() if that's not local TCP packet & LRO is not on. I kinda missed that code path.

The good thing is that batching code costs ~1.5% if used by individual packets.

I've updated the code for cxgbe/mellanox (though, compile-tested only). Any chance you could re-test on these?

Intel Xeon E5-2697Av4 (16Cores, 32 threads) with Mellanox ConnectX-4 MCX416A-CCAT (100GBase-SR4):

x 7f4e724829e (2020/12/27): inet packets-per-second forwarded
+ 7f4e724829e (2020/12/27) with D27755(v81242): inet packets-per-second forwarded
+--------------------------------------------------------------------------+
|+                      +         +          ++                       xxx x|
|                                                                     |_A| |
|          |__________________A___M_____________|                          |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      33128914      33143191      33133842      33134431     5609.5414
+   5      32854209      33032823      32983818      32968514     73310.696
Difference at 95.0% confidence
	-165917 +/- 75824.5
	-0.50074% +/- 0.228832%
	(Student's t, pooled s = 51990)

Flamegrahes

Intel Xeon E5-2697Av4 (16Cores, 32 threads) with Mellanox ConnectX-4 MCX416A-CCAT (100GBase-SR4):

x 7f4e724829e (2020/12/27): inet packets-per-second forwarded
+ 7f4e724829e (2020/12/27) with D27755(v81242): inet packets-per-second forwarded
+--------------------------------------------------------------------------+
|+                      +         +          ++                       xxx x|
|                                                                     |_A| |
|          |__________________A___M_____________|                          |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      33128914      33143191      33133842      33134431     5609.5414
+   5      32854209      33032823      32983818      32968514     73310.696
Difference at 95.0% confidence
	-165917 +/- 75824.5
	-0.50074% +/- 0.228832%
	(Student's t, pooled s = 51990)

Flamegrahes

Interesting. Okay, I guess I need to think about it more.
Thank you for spending time on doing the tests!

Might it be better to just make a new if_input method which takes an array of mbufs? List processing for a batch of packets seems like a recipe for cache misses..

Might it be better to just make a new if_input method which takes an array of mbufs? List processing for a batch of packets seems like a recipe for cache misses..

You're most probably right - I need to get a local system where I can actually look into pcm llc data & experiment with various options.

There is already a mechanism drivers can use to build an array of packets which is used by LRO. Eg, tcp_lro_queue_mbuf(). It was quite helpful, not so much for the batching, but because it sorts the array to put packets from the same TCP connection adjacent to each other..

There is already a mechanism drivers can use to build an array of packets which is used by LRO. Eg, tcp_lro_queue_mbuf(). It was quite helpful, not so much for the batching, but because it sorts the array to put packets from the same TCP connection adjacent to each other..

Yep, I looked into that one. Arrays are a bit tricky as we have to split the initial array into IPv4/IPv6 (and later mpls), preferably keeping these details away from driver.
It's more the matter of experimentation what would be the best approach w.r.t array implementation, what would be the reasonable batch size and so on.
In fact, I'd love to end up with a vectorized approach, similar to Cisco's VPP.