Using of rwlock with multiqueue NICs for IP forwarding on high pps produces high lock contention and inefficient. Replacing rwlock to rmlock allows achieve pps results that are several times higher. We use similar patch at Yandex at least from FreeBSD 9.x. AFAIK, Netflix has tested it under their workloads and no regressions were observed. So, I think the patch can be included into FreeBSD 12.0.
Details
- Reviewers
olivier melifaro - Group Reviewers
network transport - Commits
- rS335250: Switch RIB and RADIX_NODE_HEAD lock from rwlock(9) to rmlock(9).
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
On a 2 sockets, 12 Core Xeon E5 2650 with a Mellanox ConnectX-4:
x head r335106: inet4 packets-per-second + head r335106 with D15789 : inet4 packets-per-second +--------------------------------------------------------------------------+ | +| | +| | +| |x +| |x xxx +| ||__AM_| | | A| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 2531823.5 3417268 3121023.5 2974341.3 413968.07 + 5 13240135 13257591 13254631 13251573 7260.4177 Difference at 95.0% confidence 1.02772e+07 +/- 426980 345.53% +/- 63.9485% (Student's t, pooled s = 292765)
On a Xeon E5 2650 with Chelsio_T540-CR:
x head r335106: inet4 packets-per-second + head r335106 with D15789 : inet4 packets-per-second +--------------------------------------------------------------------------+ |x x + | |x x + +++| |MA| | | |_AM_|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 5812926 5942477 5827792 5868257.5 66325.8 + 5 11038688 11436516 11282262 11243417 184820.6 Difference at 95.0% confidence 5.37516e+06 +/- 202502 91.5972% +/- 3.94168% (Student's t, pooled s = 138848)
On a 8 core Atom C2758 with Chelsio T540-CR:
x head r335106: inet4 packets-per-second + head r335106 with D15789 : inet4 packets-per-second +--------------------------------------------------------------------------+ | + | |x x x xx + + + +| | |____A_M___| | | |_________A__M_____| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 3679811 3829723.5 3789131 3773040.1 59232.903 + 5 4275233 4502475.5 4422916.5 4384184.3 102410.34 Difference at 95.0% confidence 611144 +/- 122006 16.1977% +/- 3.37258% (Student's t, pooled s = 83655.3)
On 4 core AMD GX-412TC with Intel i210AT:
x head r335106: inet4 packets-per-second + head r335106 with D15789 : inet4 packets-per-second +--------------------------------------------------------------------------+ |x x + | |xxx ++ + +| ||A| | | |MA_| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 674835 678752 676834 676729.2 1573.2049 + 5 784956 793338 787191 788414.8 3448.5979 Difference at 95.0% confidence 111686 +/- 3909.03 16.5037% +/- 0.595148% (Student's t, pooled s = 2680.28)
I am totally not surprised by these numbers. However (a) did you do the same test for IPv6? (b) is that a forwarding setup or an end node setup? (c) how many route updates per second did you try on a forwarding node?
It's a forwarding setup and I'm using only 2 static routes on my benches
About the inet6 results:
On a 2 sockets, 12 Core Xeon E5 2650 with a Mellanox ConnectX-4:
x r335106 : inet6 packets-per-second + r335106 with D15789: inet6 packets-per-second +--------------------------------------------------------------------------+ | +| | +| | +| |xx +| |xxx +| ||A | | A| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 3228845.5 3435569 3275485.5 3293546.5 84303.358 + 5 12396604 12421052 12414368 12410845 10398.101 Difference at 95.0% confidence 9.1173e+06 +/- 87598.7 276.823% +/- 9.95235% (Student's t, pooled s = 60063.2)
And diff inet4 vs inet6 on this plateform:
x r335106D15789: inet4 packets-per-second + r335106D15789: inet6 packets-per-second +--------------------------------------------------------------------------+ | + x| | + x| |+++ xxx| | AM| ||AM | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 13240135 13257591 13254631 13251573 7260.4177 + 5 12396604 12421052 12414368 12410845 10398.101 Difference at 95.0% confidence -840728 +/- 13078.7 -6.34437% +/- 0.0966876% (Student's t, pooled s = 8967.56)
On a 8 core Xeon E5 2650 with Chelsio_T540-CR:
x r335106 : inet6 packets-per-second + r335106 with D15789: inet6 packets-per-second +--------------------------------------------------------------------------+ | + | |x + | |x xx x +++| ||_A__| | | MA|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 5833700 6259507 5968707.5 5995114.2 174280.36 + 5 11190415 11364997 11236384 11260193 69952.229 Difference at 95.0% confidence 5.26508e+06 +/- 193668 87.8228% +/- 5.75798% (Student's t, pooled s = 132791)
And diff inet4 vs inet6 on this plateform (bottleneck seems not in the IP stack here but NIC ?):
x r335106D15789: inet4 packets-per-second + r335106D15789: inet6 packets-per-second +--------------------------------------------------------------------------+ |x x + + + x + + x x| | |_________________________________A______M_________________________| | | |_______M____A___________| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 11038688 11436516 11282262 11243417 184820.6 + 5 11190415 11364997 11236384 11260193 69952.229 No difference proven at 95.0% confidence
On a 8 core Atom C2758 with Chelsio T540-CR:
x r335106 : inet6 packets-per-second + r335106 with D15789: inet6 packets-per-second +--------------------------------------------------------------------------+ |x x x x x + ++ + +| | |_____AM___| | | |__M_A____| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 3549377 3644218.5 3610977.5 3601811.4 37260.982 + 5 3962538 4047545.5 3995556 4003879.1 33482.287 Difference at 95.0% confidence 402068 +/- 51661 11.1629% +/- 1.52497% (Student's t, pooled s = 35422.1)
On 4 core AMD GX-412TC with Intel i210AT:
x r335106 : inet6 packets-per-second + r335106 with D15789: inet6 packets-per-second +--------------------------------------------------------------------------+ | + +| |xx x xx + + +| | |_A__| | | |MA_|| +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 614522 624910 619461 619776.4 4432.2748 + 5 721146 728208 724108 724908.6 2904.8339 Difference at 95.0% confidence 105132 +/- 5465.09 16.9629% +/- 0.988798% (Student's t, pooled s = 3747.21)
And diff inet4 vs inet6 on this platform:
x r335106D15789: inet4 packets-per-second + r335106D15789: inet6 packets-per-second +--------------------------------------------------------------------------+ | + | |+ + ++ xxx x x| | |_MA___| | | |_MA__| | +--------------------------------------------------------------------------+ N Min Max Median Avg Stddev x 5 784956 793338 787191 788414.8 3448.5979 + 5 721146 728208 724108 724908.6 2904.8339 Difference at 95.0% confidence -63506.2 +/- 4649.99 -8.05492% +/- 0.562488% (Student's t, pooled s = 3188.33)
right; I wonder if you could add about 500k routes for IPv4 and about (no idea 50k let's think ahead?) for IPv6 and then do about 25 route updates per second randomly in the address space; that would be an amazingly interesting test case (especially if you can provide the framework for that somewhere). I guess to see the initial table one could get an MRT dump from say https://www.ripe.net/analyse/internet-measurements/routing-information-service-ris/ris-raw-data or some of the others https://bgpstream.caida.org/data . I am neither saying you have to or that this should prevent this change from going in; it's just one of the things I have thought of for years related to some similar changes and others to have as a "router test bed scenario".
I think the number of routes does not matter. The problem is with contention, not with the time of lookup. I think you want to see what is the cost of rm_wlock() in comparison to rw_wlock()?
The test scenario can be the following: add several routes with static arp/ndp entries and then change the default route between them in a loop with some delay.