Page MenuHomeFreeBSD

Proof-of-concept hack: remove mp_ring from TX path
Needs ReviewPublic

Authored by shurd on Jul 30 2019, 8:22 PM.

Details

Reviewers
gallatin
olivier
marius
Group Reviewers
iflib
Summary

This is a quick hack that removes the mp_ring stuff from the
tx path for non-netmap only. Basically, it leaves the mp ring stuff
there, but bypasses it and calls the drain function directly.

This allows "easier" A/B testing.

To enable the hack, set dev.X.Y.iflib.bypass_mpring to non-zero in
loader.conf (cannot be changed after boot).

It would be nice to get other people to test this before the iflib
call.

THIS IS NEVER GOING TO BE COMMITTED IN THIS FORM!

If this pans out, pretty much all the txq stuff will be ripped out
and rewritten.

Test Plan

Test this, if it's "good", rewrite it all.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 26406
Build 24863: arc lint + arc unit

Event Timeline

shurd created this revision.Jul 30 2019, 8:22 PM

Results on PC Engines APU2 (notice that I had to set back iflib.tx_abdicate=0, because if=1, all packets are dropped in another dimension):

x inet4 forwarding of small packet size, bypass_mpring=0 in packets-per-second
+ inet4 forwarding of small packet size, bypass_mpring=1 in packets-per-second
+--------------------------------------------------------------------------+
|+         +         + +  +                                    x    x x  xx|
|                                                                |____A___||
|     |_________A____M_____|                                               |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        462506        464435        463851      463697.4     759.63235
+   5        451908        456286        455316        454538     1785.9727
Difference at 95.0% confidence
        -9159.4 +/- 2001.51
        -1.9753% +/- 0.430346%
        (Student's t, pooled s = 1372.36)

This mean a small reduction of performance.

On another hardware (Atom 4cores):

x inet4 forwarding of small packet size, bypass_mpring=0 in packets-per-second
+ inet4 forwarding of small packet size, bypass_mpring=1 in packets-per-second
+--------------------------------------------------------------------------+
|+                              +  ++  +                x           x x x  |
|                                                          |______A__M____||
|            |_______________A_____M________|                              |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4      927858.5        983285        971934     963752.88      24606.11
+   5        749042        872241        859294      839484.8     51106.313
Difference at 95.0% confidence
        -124268 +/- 66405
        -12.8942% +/- 6.76603%
        (Student's t, pooled s = 41856.6)

Bigger impact her: -12%

On another hardware (Atom 4cores):

x inet4 forwarding of small packet size, bypass_mpring=0 in packets-per-second
+ inet4 forwarding of small packet size, bypass_mpring=1 in packets-per-second
+--------------------------------------------------------------------------+
|+                              +  ++  +                x           x x x  |
|                                                          |______A__M____||
|            |_______________A_____M________|                              |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4      927858.5        983285        971934     963752.88      24606.11
+   5        749042        872241        859294      839484.8     51106.313
Difference at 95.0% confidence
        -124268 +/- 66405
        -12.8942% +/- 6.76603%
        (Student's t, pooled s = 41856.6)

Bigger impact her: -12%

This is a big enough regression that profiling would be helpful. Would it be possible to get hwpmc output (or flamegraphs) while the test is running? Perhaps also 'lockstat -s 5 -x aggsize=4m sleep 10 > output' from each test. Thanks!

Is it possible you have enabled/disabled reversed?

I'm seeing lock_delay() consuming a bit of time, which I'd expected for "disabled", but I'm seeing it in the "enabled" svg.. That is confusing to me.

Drew

Yes you've right! The correct label should be "bypass_mpring" disabled/enabled... which revert the logic.

shurd added a comment.Aug 29 2019, 6:32 PM

Wait... why is iflib_if_transmit() being called in both? It should be using iflib_if_transmit_bypass() instead.

shurd updated this revision to Diff 61947.Sep 11 2019, 7:44 PM

Use a tunable instead of a sysctl to toggle mp_ring bypass.

olivier added a comment.EditedSep 12 2019, 10:22 PM

Following the code with TUNABLE_INT_FETCH("dev.%s.%d.iflib.bypass_mpring"), I've added these line in the /boot/loader.conf.local:

dev.igb.0.iflib.bypass_mpring=1
dev.igb.1.iflib.bypass_mpring=1
dev.igb.2.iflib.bypass_mpring=1
dev.igb.3.iflib.bypass_mpring=1
dev.igb.4.iflib.bypass_mpring=1
dev.igb.5.iflib.bypass_mpring=1

New results with this new version on atom C2558 (4 cores).

x bypass_mpring=0 (default): inet4 packets-per-second forwarded
+ bypass_mpring=1: inet4 packets-per-second forwarded
+--------------------------------------------------------------------------+
|x         x               x          xx                 +   +  + +       +|
|     |________________A___M____________|                                  |
|                                                         |_____A______|   |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      884557.5        909555        902052      899342.3      11144.15
+   5        921463        932932        926300      926564.8     4308.0002
Difference at 95.0% confidence
        27222.5 +/- 12321.5
        3.02693% +/- 1.40621%
        (Student's t, pooled s = 8448.4)

We notice a very small improvement of 3% with IPv4 and no difference with IPv6... So I'm not sure there are some difference:

x bypass_mpring=0 (default): inet6 packets-per-second forwarded
+ bypass_mpring=1:  inet6 packets-per-second forwarded
+--------------------------------------------------------------------------+
|     ++               +  x     +   x x      x  x                         +|
|                             |_______MA_______|                           |
||_____________________M_____A__________________________|                  |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        865575        879867        872919      873596.2     5553.5991
+   5        852499        896702        863709      867059.5     18021.085
No difference proven at 95.0% confidence

Flamegraphs, with reference to iflib_if_transmit_bypass() when bypass_mpring is enabled:

shurd added a comment.Sep 16 2019, 5:53 PM

For the bypass disabled results, is tx_abdicate set?

tx_abdicate was not set (default value of 0) for these benches.
If I'm setting tx_abdicate, here are the new results:

x bypass_mpring=0 (default) and tx_abdicate=1: inet4 packets-per-second forwarded
+ bypass_mpring=1 and tx_abdicate=1: inet4 packets-per-second forwarded
+--------------------------------------------------------------------------+
|         +                                                                |
|         +                                                                |
|+        ++                                               x x xx         x|
|                                                         |____MA_____|    |
|   |___A_M__|                                                             |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1060792       1099642       1069370     1073776.1     15108.116
+   5      906915.5        934633        930094        926433     11078.667
Difference at 95.0% confidence
        -147343 +/- 19320.7
        -13.722% +/- 1.64298%
        (Student's t, pooled s = 13247.5)

Wit tx_abdicate is enabled there is about -13% of performance reduction once bypass_mpring is enabled.

Flamegraphs:

Flamegraphs:

The thing that is interesting from the bypass flame is that we spend a fairly large amount of time (16%) actually in iflib_txq_drain. We also spend 4.4% in the mtx_lock_sleep() call. I wonder if some of that 16% is spent spinning on a lock?

Would it be possible to capture lockstat -s 5 sleep 10 output from this run? Add -x aggsize=4m as the first argument if you see aggregation drops.

I wonder if we're contending for the lock because of N:1 traffic fighting over the ring, or if we're contending with the tx task cleaning descriptors, or something else.

shurd added a comment.Sep 17 2019, 6:01 PM

I assume the two piles in the bypass flamegraphs which don't start at fork_exit() are just because the stack sample isn't deep enough to correlate them over and that the iflib_txq_drain() samples in the wide peak are actually the ones that are directly in iflib_txq_drain?

Since iflib_txq_drain() only ever gets one item in bypass, I guess I should write an uncrappy bypass version of it now... which should clear up a bunch of that.

Uncrappy is nice. But if we've got lock contention, it would be nice to know why.

shurd updated this revision to Diff 62235.Sep 17 2019, 9:28 PM

Create a iflib_txq_drain_bypass() function called only from
iflib_if_transmit_bypass() which doesn't have the loop,
volatile/devolatile casts, conditional prefetching, txq-cast-to-mbuf,
etc overhead and only locks/unlocks once.

This should help provide a clearer picture of mp_ring costs.

So new result with new patch's version:

x bypass_mpring=0 (default) and tx_abdicate=1: inet4 packets-per-second forwarded
+ bypass_mpring=1 and tx_abdicate=1: inet4 packets-per-second forwarded
+--------------------------------------------------------------------------+
|+        ++        +  +                            x            x  x   x  |
|                                                         |_______A_M_____||
|   |______M_A________|                                                    |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1043042       1084017       1074558     1070992.4     16759.158
+   5      938950.5        984677        959011      963327.5     17971.342
Difference at 95.0% confidence
        -107665 +/- 25341.6
        -10.0528% +/- 2.25866%
        (Student's t, pooled s = 17375.8)

Flamegraphs:

Lockstat (had to reduce generator packet's rate to about 800Kpps to be able to use it):

shurd added a comment.Sep 20 2019, 5:42 AM

So two things jump out at me here... first, a quick look suggests _task_fn_tx() -> iflib_txq_drain() is recursing on the mutex, and second the interrupt handler is contending with tx for no good reason.

I think that as a hack, just removing lines 3875 and 3876 from iflib.c would give a close approximation of what doing it properly would look like if both issues were fixed.

After removing those 2 lines:

3875		if (txq->ift_in_use)
3876			txq->ift_br->drain(txq->ift_br, NULL, 0, 0);

The result is now different (7% improvement):

x bypass_mpring=0 (default) and tx_abdicate=1: inet4 packets-per-second forwarded
+ bypass_mpring=1 and tx_abdicate=1: inet4 packets-per-second forwarded
+--------------------------------------------------------------------------+
|x    x  x          x         x                           +    ++      +  +|
||_______M___A___________|                                                 |
|                                                          |____M_A_____|  |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5     1047806.5       1091725       1059590     1066031.6     17660.615
+   5       1132410       1156922       1141373     1144724.6     9694.1874
Difference at 95.0% confidence
        78693 +/- 20776.4
        7.38186% +/- 2.0604%
        (Student's t, pooled s = 14245.6)

Flamegraphs:

shurd added a comment.Sep 20 2019, 9:18 PM

Excellent news, and it looks like there's still some performance left on the table there. I guess the question now is if I consider the theory that getting rid of the mp ring improves performance proven and start gutting it, or if I finish separating the TX path from the reclaim path and clean this hack up more so further testing can be done.

In my tests of UDP transmit using N copies of netperf on a 14c/28t Xeon E5-2697 v3 (haswell) using a 40Gbe ixl, I see nearly a 100% performance improvement (eg, double the packet rate) as compared to using mp_ring. In fact, for a single-threaded test the packet rate is 4x what it is for mp_ring with tx-abdicate enabled.

This brings packet rate for iflib in-line with Mellanox mxl5.

num	bypass	abd	abd	mce
netperf		on	off
=====	=====	====	====	=====
1	1.6	0.42	1.11	1.6	
2	3	1.09	2.05	2.8	
4	4.8	1.5	3.58	4.71	
8	9.57	3.08	6.48	9.45	
16	16.6	5.13	7.14	16.44	
32	19.2	9.49	8.49	17.85	
64	18.29	11.18	9.45	18.33	
128	17.43	10.89	9.09	18.21

Exact netperf command line (repeated num-netperf times) was: -H192.168.2.1 -s 2 -tUDP_STREAM -l 40 -- -m 1 -n -N -R 0