Details
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 35666 Build 32556: arc lint + arc unit
Event Timeline
sys/net/netisr_internal.h | ||
---|---|---|
50 | Missing typedef void *netisr_handler_m_t; here ? Without, I meet this error during a buildworld: --- all_subdir_usr.bin/netstat --- In file included from /usr/src/usr.bin/netstat/netisr.c:44: /usr/src/amd64.amd64/tmp/usr/include/net/netisr_internal.h:64:2: error: unknown type name 'netisr_handler_m_t'; did you mean 'netisr_handler_t'? netisr_handler_m_t *np_handler_m;/* Protocol handler. */ ^ /usr/src/amd64.amd64/tmp/usr/include/net/netisr_internal.h:50:15: note: 'netisr_handler_t' declared here typedef void *netisr_handler_t; ^ |
Small device:
x 7f4e724829e (2020/12/27): inet packets-per-second forwarded + 7f4e724829e (2020/12/27) with D27755: inet packets-per-second forwarded +--------------------------------------------------------------------------+ |+ + + + + x xx x x| | |__________M___A______________| | | |_________A__________| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 897815 910126.5 903309 904485.3 4890.7439 + 5 886306 894945 889997 889957.2 3447.4139 Difference at 95.0% confidence -14528.1 +/- 6170.78 -1.60623% +/- 0.674941% (Student's t, pooled s = 4231.08)
Big device:
x 7f4e724829e (2020/12/27): inet packets-per-second forwarded + 7f4e724829e (2020/12/27) with D27755: inet packets-per-second forwarded +--------------------------------------------------------------------------+ | + | |+ + + + x x xx x| | |__AM_| | | |___AM__| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 33097916 33161594 33139005 33133205 23817.267 + 5 32595089 32676140 32638309 32634321 29675.115 Difference at 95.0% confidence -498884 +/- 39241 -1.50569% +/- 0.117739% (Student's t, pooled s = 26906.1)
Thank you for spending cycles on testing it!
Small device:
Is it AMD_GX-412TC_4Cores-Intel_i210AT ?
So, there was no chance in having any improvements here as all traffic goes to tcp_lro_..() which passes packet to if_input() if that's not local TCP packet & LRO is not on. I kinda missed that code path.
The good thing is that batching code costs ~1.5% if used by individual packets.
I've updated the code for cxgbe/mellanox (though, compile-tested only). Any chance you could re-test on these?
Intel Xeon E5-2697Av4 (16Cores, 32 threads) with Mellanox ConnectX-4 MCX416A-CCAT (100GBase-SR4):
x 7f4e724829e (2020/12/27): inet packets-per-second forwarded + 7f4e724829e (2020/12/27) with D27755(v81242): inet packets-per-second forwarded +--------------------------------------------------------------------------+ |+ + + ++ xxx x| | |_A| | | |__________________A___M_____________| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 33128914 33143191 33133842 33134431 5609.5414 + 5 32854209 33032823 32983818 32968514 73310.696 Difference at 95.0% confidence -165917 +/- 75824.5 -0.50074% +/- 0.228832% (Student's t, pooled s = 51990)
Interesting. Okay, I guess I need to think about it more.
Thank you for spending time on doing the tests!
Might it be better to just make a new if_input method which takes an array of mbufs? List processing for a batch of packets seems like a recipe for cache misses..
You're most probably right - I need to get a local system where I can actually look into pcm llc data & experiment with various options.
There is already a mechanism drivers can use to build an array of packets which is used by LRO. Eg, tcp_lro_queue_mbuf(). It was quite helpful, not so much for the batching, but because it sorts the array to put packets from the same TCP connection adjacent to each other..
Yep, I looked into that one. Arrays are a bit tricky as we have to split the initial array into IPv4/IPv6 (and later mpls), preferably keeping these details away from driver.
It's more the matter of experimentation what would be the best approach w.r.t array implementation, what would be the reasonable batch size and so on.
In fact, I'd love to end up with a vectorized approach, similar to Cisco's VPP.