--- all_subdir_usr.bin/netstat ---
In file included from /usr/src/usr.bin/netstat/netisr.c:44:
/usr/src/amd64.amd64/tmp/usr/include/net/netisr_internal.h:64:2: error: unknown type name 'netisr_handler_m_t'; did you mean 'netisr_handler_t'?
    netisr_handler_m_t *np_handler_m;/* Protocol handler. */
    ^                                                                                                                                   
/usr/src/amd64.amd64/tmp/usr/include/net/netisr_internal.h:50:15: note: 'netisr_handler_t' declared here
    typedef void *netisr_handler_t;
                            ^

Fix netisr_handler_m_t definition for userland.

Harbormaster completed remote builds in B35702: Diff 81208.Dec 27 2020, 11:16 AM

sys/netinet/ip_fastfwd.c
451	This printf needs to be commented :-)

Remove forgotten printf().

Harbormaster completed remote builds in B35704: Diff 81210.Dec 27 2020, 12:37 PM

Small device:

x 7f4e724829e (2020/12/27): inet packets-per-second forwarded
+ 7f4e724829e (2020/12/27) with D27755: inet packets-per-second forwarded
+--------------------------------------------------------------------------+
|+  +       +   +          +        x               xx               x    x|
|                                         |__________M___A______________|  |
| |_________A__________|                                                   |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        897815      910126.5        903309      904485.3     4890.7439
+   5        886306        894945        889997      889957.2     3447.4139
Difference at 95.0% confidence
        -14528.1 +/- 6170.78
        -1.60623% +/- 0.674941%
        (Student's t, pooled s = 4231.08)

Big device:

x 7f4e724829e (2020/12/27): inet packets-per-second forwarded
+ 7f4e724829e (2020/12/27) with D27755: inet packets-per-second forwarded
+--------------------------------------------------------------------------+
|      +                                                                   |
|+  +  +   +                                                      x  x xx x|
|                                                                  |__AM_| |
| |___AM__|                                                                |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      33097916      33161594      33139005      33133205     23817.267
+   5      32595089      32676140      32638309      32634321     29675.115
Difference at 95.0% confidence
        -498884 +/- 39241
        -1.50569% +/- 0.117739%
        (Student's t, pooled s = 26906.1)

Flamegraphs here

Add batching for some tcp_lro-based paths.

Herald added a reviewer: transport. · View Herald TranscriptDec 27 2020, 11:31 PM

Harbormaster completed remote builds in B35719: Diff 81242.Dec 27 2020, 11:34 PM

In D27755#620976, @olivier wrote:

Thank you for spending cycles on testing it!

Small device:

Is it AMD_GX-412TC_4Cores-Intel_i210AT ?

Flamegraphs here

So, there was no chance in having any improvements here as all traffic goes to tcp_lro_..() which passes packet to if_input() if that's not local TCP packet & LRO is not on. I kinda missed that code path.

The good thing is that batching code costs ~1.5% if used by individual packets.

I've updated the code for cxgbe/mellanox (though, compile-tested only). Any chance you could re-test on these?

Intel Xeon E5-2697Av4 (16Cores, 32 threads) with Mellanox ConnectX-4 MCX416A-CCAT (100GBase-SR4):

x 7f4e724829e (2020/12/27): inet packets-per-second forwarded
+ 7f4e724829e (2020/12/27) with D27755(v81242): inet packets-per-second forwarded
+--------------------------------------------------------------------------+
|+                      +         +          ++                       xxx x|
|                                                                     |_A| |
|          |__________________A___M_____________|                          |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      33128914      33143191      33133842      33134431     5609.5414
+   5      32854209      33032823      32983818      32968514     73310.696
Difference at 95.0% confidence
	-165917 +/- 75824.5
	-0.50074% +/- 0.228832%
	(Student's t, pooled s = 51990)

Flamegrahes

In D27755#621320, @olivier wrote:

Intel Xeon E5-2697Av4 (16Cores, 32 threads) with Mellanox ConnectX-4 MCX416A-CCAT (100GBase-SR4):

x 7f4e724829e (2020/12/27): inet packets-per-second forwarded
+ 7f4e724829e (2020/12/27) with D27755(v81242): inet packets-per-second forwarded
+--------------------------------------------------------------------------+
|+                      +         +          ++                       xxx x|
|                                                                     |_A| |
|          |__________________A___M_____________|                          |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      33128914      33143191      33133842      33134431     5609.5414
+   5      32854209      33032823      32983818      32968514     73310.696
Difference at 95.0% confidence
	-165917 +/- 75824.5
	-0.50074% +/- 0.228832%
	(Student's t, pooled s = 51990)

Flamegrahes

Interesting. Okay, I guess I need to think about it more.
Thank you for spending time on doing the tests!

Might it be better to just make a new if_input method which takes an array of mbufs? List processing for a batch of packets seems like a recipe for cache misses..

In D27755#626054, @gallatin wrote:

Might it be better to just make a new if_input method which takes an array of mbufs? List processing for a batch of packets seems like a recipe for cache misses..

You're most probably right - I need to get a local system where I can actually look into pcm llc data & experiment with various options.

There is already a mechanism drivers can use to build an array of packets which is used by LRO. Eg, tcp_lro_queue_mbuf(). It was quite helpful, not so much for the batching, but because it sorts the array to put packets from the same TCP connection adjacent to each other..

In D27755#626124, @gallatin wrote:

There is already a mechanism drivers can use to build an array of packets which is used by LRO. Eg, tcp_lro_queue_mbuf(). It was quite helpful, not so much for the batching, but because it sorts the array to put packets from the same TCP connection adjacent to each other..

Yep, I looked into that one. Arrays are a bit tricky as we have to split the initial array into IPv4/IPv6 (and later mpls), preferably keeping these details away from driver.
It's more the matter of experimentation what would be the best approach w.r.t array implementation, what would be the reasonable batch size and so on.
In fact, I'd love to end up with a vectorized approach, similar to Cisco's VPP.