Page MenuHomeFreeBSD

Use byte-counting rather than packet counting for TX batch size
Needs ReviewPublic

Authored by shurd on Dec 12 2018, 7:09 PM.

Details

Summary

Previously, TX_BATCH_SIZE was the max number of packets that
would be delivered in a single call to iflib_txq_drain(). This change
uses TX_BATCH_SIZE * MTU as the number of bytes to target. This is
limited to a max of isc_txrx_budget_bytes_max, a value which appears
to have been previously unused, and defaults to 2 * 1024 * 1024.

In theory, this should perform better with small packet forwarding than
tx_abdicate=1 currently provides.

TX_BATCH_SIZE should also likely be a per-interface tunable, possibly
with a global default tunable as well... but currently it's used as
a number of mp_ring entries in drain_ring_locked() and a source for a
number of bytes in iflib_txq_drain(). These two should likely be split
up, maybe with the number of bytes being the only tunable.

It may be useful to tie the default byte budget to the link speed too.

Test Plan

Get lev and olivier to run it through the ringer

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 21667
Build 20956: arc lint + arc unit

Event Timeline

shurd created this revision.Dec 12 2018, 7:09 PM
lev added a comment.Dec 13 2018, 7:13 PM

I could say, that with this patch and *without* tx_abdicate results are:

  1. Without IPsec is the same both in bandwidth (kb/s) and throughput (pps) is not worse than without it. It is hard to say, that it is better as it is near ability of my test rig to generate traffic anyway.
  2. With IPsec it is slightly better both for bandwidth and throughput in both directions.

I've tested r341987.

lev added a comment.Dec 13 2018, 8:02 PM

With this patch and *with* tx_abdicate results are mixed.

  1. Without IPsec is the same both in bandwidth (kb/s) and throughput (pps) is not worse than without it, as in previous case.
  2. With IPsec tx_abdicate adds a lot (~15%) to output bandwidth and ~18% to output throughput (with small packets!), and input numbers vary from test to test. Input bandwidth and throughput are sometimes same as with this patch and without tx_abdicate and sometimes only half of that. So, one iteration of test it shows 313MiB/s and next iteration it is 86MiB/s. But best results are same as without tx_abdicate.

So, formally, measured effect is positive for combination of this patch + tx_abdicate but benchmark shows very wired behavior of system under load when tx_abdicate is enabled.

There is example of these spikes (anti-spikes)? with this patch + IPsec + tx_abdicate:

Benchmark tool using equilibrium throughput method
- Benchmark mode: Bandwitdh (bps) for VPN gateway
- UDP load = 500B, IPv4 packet size=528B, Ethernet frame size=542B
- Link rate = 1000 Mb/s
- Tolerance = 0.01
Iteration 1
  - Offering load = 500 Mb/s
  - Step = 250 Mb/s
  - Measured forwarding rate = 326 Mb/s
Iteration 2
  - Offering load = 250 Mb/s
  - Step = 250 Mb/s
  - Trend = decreasing
  - Measured forwarding rate = 250 Mb/s
Iteration 3
  - Offering load = 375 Mb/s
  - Step = 125 Mb/s
  - Trend = increasing
  - Measured forwarding rate = 84 Mb/s
Iteration 4
  - Offering load = 313 Mb/s
  - Step = 62 Mb/s
  - Trend = decreasing
  - Measured forwarding rate = 313 Mb/s
Iteration 5
  - Offering load = 344 Mb/s
  - Step = 31 Mb/s
  - Trend = increasing
  - Measured forwarding rate = 269 Mb/s
Iteration 6
  - Offering load = 329 Mb/s
  - Step = 15 Mb/s
  - Trend = decreasing
  - Measured forwarding rate = 329 Mb/s
Iteration 7
  - Offering load = 336 Mb/s
  - Step = 7 Mb/s
  - Trend = increasing
  - Measured forwarding rate = 336 Mb/s
Estimated Equilibrium Ethernet throughput= 336 Mb/s (maximum value seen: 336 Mb/s)

and

Benchmark tool using equilibrium throughput method
- Benchmark mode: Throughput (pps) for Router
- UDP load = 18B, IPv4 packet size=46B, Ethernet frame size=60B
- Link rate = 1488 Kpps
- Tolerance = 0.01
Iteration 1
  - Offering load = 744 Kpps
  - Step = 372 Kpps
  - Measured forwarding rate = 149 Kpps
  - Forwared rate too low, forcing OLOAD=FWRATE and STEP=FWRATE/2
Iteration 2
  - Offering load = 149 Kpps
  - Step = 74 Kpps
  - Trend = decreasing
  - Measured forwarding rate = 147 Kpps
Iteration 3
  - Offering load = 186 Kpps
  - Step = 37 Kpps
  - Trend = increasing
  - Measured forwarding rate = 79 Kpps
Iteration 4
  - Offering load = 168 Kpps
  - Step = 18 Kpps
  - Trend = decreasing
  - Measured forwarding rate = 148 Kpps
Iteration 5
  - Offering load = 159 Kpps
  - Step = 9 Kpps
  - Trend = decreasing
  - Measured forwarding rate = 148 Kpps
Iteration 6
  - Offering load = 155 Kpps
  - Step = 4 Kpps
  - Trend = decreasing
  - Measured forwarding rate = 79 Kpps
Iteration 7
  - Offering load = 153 Kpps
  - Step = 2 Kpps
  - Trend = decreasing
  - Measured forwarding rate = 148 Kpps
Estimated Equilibrium Ethernet throughput= 148 Kpps (maximum value seen: 149 Kpps)
lev added a comment.Dec 13 2018, 8:04 PM

These drops of performance with tx_abdicate which is almost 2 times looks like RSS failure?..

Here is my DoS benches results on my smallest 2 hardwares.
First serie comparing the benefit of D18532, so with tx_abdicate disabled (default behaviour)

PC Engines APU2 (AMD AMD GX-412TC 1Ghz, 4core, Intel I210 NIC):

x r342020: inet4 packets-per-second forwarded
+ r342020 with D18532: inet4 pps forwarded
+--------------------------------------------------------------------------+
|                   +                                                      |
|++           +     +                                         x  x   xx   x|
|                                                              |____AM___| |
| |_________A_M______|                                                     |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        491465        494353        493274      492919.8     1120.8518
+   5      477373.5        481892        480485      479856.7     2191.9926
Difference at 95.0% confidence
        -13063.1 +/- 2538.93
        -2.65015% +/- 0.512281%
        (Student's t, pooled s = 1740.85)

We notice a little regression here.

Netgate RCC-VE (Atom C2558 2.40GHz 4-core, Intel I354 NIC):

x r342020: inet4 packets-per-second forwarded
+ r342020 with D18532: inet4 pps forwarded
+--------------------------------------------------------------------------+
|                                                               +          |
|x                                                      x  x xx + ++     + |
|                    |__________________________A__________M______________||
|                                                              |__MA___|   |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        767461        935791        926024      895896.6     72122.243
+   5      939382.5        966167        944748      947896.7     11069.556
No difference proven at 95.0% confidence

There is statiscally no difference on this device, but the distribution seems better.

  • Now let's compare the a head with tx_abdicate enabled and an head+D1852 without tx_abdicate**

Purpose is to validate the

In theory, this should perform better with small packet forwarding than
tx_abdicate=1 currently provides.

statement

On a PC Engines APU2:

x r342020 and tx_abdicate enabled: inet4 pps
+ r342020 with D18532 and tx_abdicate disabled (default): inet4 pps
+--------------------------------------------------------------------------+
| +                                                                        |
|++                                                                    x   |
|++                                                                    xxxx|
|                                                                      |A_||
|AM                                                                        |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        841458      859032.5        849186      849301.7     7923.3758
+   5      477373.5        481892        480485      479856.7     2191.9926
Difference at 95.0% confidence
        -369445 +/- 8478.1
        -43.4999% +/- 0.605256%
        (Student's t, pooled s = 5813.12)

On a Netgate RCC-VE:

x r342020 and tx_abdicate enabled: inet4 pps
+ r342020 with D18532 and tx_abdicate disabled (default): inet4 pps
+--------------------------------------------------------------------------+
| +                                                                        |
| +  + +       +                                             x   x     xx x|
|                                                              |_____A_M__||
||___MA____|                                                               |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1064878       1092530       1087699     1081767.9     11669.501
+   5      939382.5        966167        944748      947896.7     11069.556
Difference at 95.0% confidence
        -133871 +/- 16587.6
        -12.3752% +/- 1.43662%
        (Student's t, pooled s = 11373.5)

tx_abdicate still brings lot's major gain against D18532.
Theory invalided ?

gallatin added inline comments.Dec 14 2018, 1:26 AM
sys/net/iflib.c
3448

I'd rather not add a function call and a possible cache miss to get the MTU on every drain. I think part of the reason that abdicate is beneficial is that the costs of the drain are ammortized, so we don't want to make it more expensive.

Maybe we could pre-calculate this value at every mtu update, and avoid the min() and potential cache miss to peek into the txrx_budget?

(these comments on misses are speculative, i have not analyzed it carefully)

shurd marked an inline comment as done.Dec 14 2018, 6:01 AM

tx_abdicate still brings lot's major gain against D18532.
Theory invalided ?

Looks like, it seems either the size of the batch isn't the major factor at play, or the batch size is still too small. I'm curious what happens if we try to fill the tx queue on every drain, rather than putting a limit on it.

sys/net/iflib.c
3448

Yeah, it was something I was planning on doing if this helped on the weaker hardware, but it seems that just increasing the number of packets in a drain may not be enough.

shurd updated this revision to Diff 51986.Dec 14 2018, 6:03 AM

Remove the drain limit completely and instead try to fill the TX queue on each drain.

I like this much better; hopefully the performance tests will show that it is an improvement.

Comparing DoS benches results with the previous Diff version applied and this one didn't show difference.

PC Engines APU2 (just a little improvement):

x r342020 with D18532(old: diff51924) and tx_abdicate disabled (default): inet4 pps
+ r342020 with D18532(new: diff51986) and tx_abdicate disabled (default): inet4 pps
+--------------------------------------------------------------------------+
|                            x                                       +     |
|x x                x        x                                      ++  + +|
|  |_____________A__M_________|                                            |
|                                                                   |MA__| |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      477373.5        481892        480485      479856.7     2191.9926
+   5        488056        489057        488332      488480.5     393.60164
Difference at 95.0% confidence
        8623.8 +/- 2296.7
        1.79716% +/- 0.486957%
        (Student's t, pooled s = 1574.76)

On the netgate (nothing here):

x r342020 with D18532(old: diff51924) and tx_abdicate disabled (default): inet4 pps
+ r342020 with D18532(new: diff51986) and tx_abdicate disabled (default): inet4 pps
+--------------------------------------------------------------------------+
|   +   *x+           x           x+                 +                    x|
| |___________________M______A__________________________|                  |
||________M___________A____________________|                               |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      939382.5        966167        944748      947896.7     11069.556
+   5      937494.5        957404        940105      944903.7     8576.8867
No difference proven at 95.0% confidence

Other platform (supermicro 5018A-FTN4, Atom C2758 8-cores at 2.4Ghz with intel 82599ES 10-Gigabit):

x r342020 with D18532(old: diff51924) and tx_abdicate disabled (default): inet4 pps
+ r342020 with D18532(new: diff51986) and tx_abdicate disabled (default): inet4 pps
+--------------------------------------------------------------------------+
| x             xx                      +    +          *               + +|
||______________M____A____________________|                                |
|                                         |_____________MA_______________| |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       3179260       3260787       3200772     3208284.5     30758.531
+   5       3235848       3287655       3260728     3262531.6     23384.011
Difference at 95.0% confidence
        54247.1 +/- 39846.4
        1.69084% +/- 1.25533%
        (Student's t, pooled s = 27321.2)
shurd updated this revision to Diff 52205.Dec 20 2018, 7:59 PM
shurd marked an inline comment as done.

In addition to trying to keep the TXQ full, use a mp_ring size that's half the
number of descriptors. Previously, the mp_ring was a fixed size which happened
to be twice the default size of the txq for my em devices.

Hi,
I've got igb only NIC, not em, so I can't use my lab to bench this new review version.

shurd added a comment.Mon, Jan 7, 8:42 PM

Hi,
I've got igb only NIC, not em, so I can't use my lab to bench this new review version.

It should have the same effect (or lack thereof) for igb as well, they both seem to use the same defaults.

Ok, so let's try again with this latest version (I'm calling this one D18532v3):

PC Engines APU, comparing a generic head tuned with tx_abdicate enabled against this revision without tunning (tx_abdicate disabled):

x r342020 and tx_abdicate enabled: inet4 pps
+ r342020 with D18532v3 and tx_abdicate disabled (default): inet4 pps
+--------------------------------------------------------------------------+
|+                                                                         |
|+                                                                         |
|+                                                                         |
|++                                                                   xxxxx|
|                                                                     |_A_||
|A|                                                                        |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        841458      859032.5        849186      849301.7     7923.3758
+   5        493849        497916      494500.5      495086.6     1675.0342
Difference at 95.0% confidence
        -354215 +/- 8351.77
        -41.7066% +/- 0.596585%
        (Student's t, pooled s = 5726.5)

> Still very big favor to enable tx_abdicate for this use case.

PC Engines APU, comparing head with booth x_abdicate disabled, to compare the same config set:

x r342020 and tx_abdicate disabled (default): inet4 pps
+ r342020 with D18532v3 and tx_abdicate disabled (default): inet4 pps
+--------------------------------------------------------------------------+
|x       x           xx     ++    x+        +                             +|
|    |___________A___M________|                                            |
|                      |___________M______A__________________|             |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        491465        494353        493274      492919.8     1120.8518
+   5        493849        497916      494500.5      495086.6     1675.0342
Difference at 95.0% confidence
        2166.8 +/- 2078.48
        0.439585% +/- 0.422242%
        (Student's t, pooled s = 1425.14)

> We notice a little improvement

Now on the Netgate RCC:

x r342020 and tx_abdicate enabled: inet4 pps
+ r342020 with D18532v3 and tx_abdicate disabled (default): inet4 pps
+--------------------------------------------------------------------------+
|+   +  +    + +                                             x   x     xx x|
|                                                               |____A_M__||
|  |____MA____|                                                            |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1064878       1092530       1087699     1081767.9     11669.501
+   5        931142        962720        947798      947969.4     12699.818
Difference at 95.0% confidence
        -133798 +/- 17786.5
        -12.3685% +/- 1.55441%
        (Student's t, pooled s = 12195.5)

> using tx_abdicate for this use case is still a big win.

Comparing same config set now:

x r342020 and tx_abdicate disabled (default): inet4 pps
+ r342020 with D18532v3 and tx_abdicate disabled (default): inet4 pps
+--------------------------------------------------------------------------+
|x                                                      x  x *x +  +  + +  |
|                    |__________________________A__________M______________||
|                                                             |____A___|   |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        767461        935791        926024      895896.6     72122.243
+   5        931142        962720        947798      947969.4     12699.818
No difference proven at 95.0% confidence

> No improvement noticed here

lev added a comment.Mon, Jan 14, 10:44 AM

Ok, so let's try again with this latest version (I'm calling this one D18532v3):

What is your benchmark? I'm using your equilibrium script and see very different effect of tx_abdicate depending on «direction» of test: when I emulate «small network sends to big Internet» result is different to «Big Internet sends to small network». Unfortunately, there is no easy way to emulate real traffic, as equilibrium is strictly unidirectional.

It's the standard "DoS" method: I'm unidirectional sending line-rate of smallest size packet.

lev added a comment.Mon, Jan 14, 1:18 PM

It's the standard "DoS" method: I'm unidirectional sending line-rate of smallest size packet.

Question is — how many source/destination IPs and ports are used? It is what determine usability of tx_abduction for me, is it some-to-many («from LAN to WAN») or many-to-one («WAN to DMZ box in LAN».

I'm generating about 2000 flows and I'm seeing a big improvement by enabling tx_abduction too with iflib.
But my understanding is this review is not about tx_abduction but about TX_BATCH_SIZE, and that I had to check if this TX_BATCH_SIZE patch will bring better performances in forwarding mode.
This is why I've made 2 DoS forwarding benches:

  • One comparing TX_BATCH_SIZE patch against 2 head configured identically
  • One comparing performance impact brings by this TX_BATCH_SIZE patch against the performance brings by tx_abduction
lev added a comment.Fri, Jan 18, 3:41 PM

I'm generating about 2000 flows and I'm seeing a big improvement by enabling tx_abduction too with iflib.

I'm testing TWO scenarios:

  • First is «LAN to WAN» and flows are 10.1.0.2:2000-10.1.0.5:200410.10.10.2:2000-10.10.10.128:2006 — it should be 4×5×127×7 = 17780 flows.
  • Second is «WAN to DMZ» and flows are 10.10.10.2:2000-10.10.10.25410.1.0.2:2000 — it should be only 253 flows.

Other trick is, I tests not only «raw» routing, but throw in IPsec (and gre and gif and ipfw with and without NAT, so my configuration space contains 87 configurations, but here I'm speaking only about simplest cases), which always works from 10.1.0.1/24 to 10.10.10.0/24 between DUT and traffic mirror (which is much more powerful). So, first and second cases becomes even more asymmetrical.
First case becomes «receive, encrypt, send through tunnel» and second cases becomes «receive from tunnel, decrypt, send in clear», which should affect RSS and flow distribution, as far as I understand.

Now I've re-tested very fresh current with this patch with and without abdication with AES-NI enabled Intel J3160 and I211-AT. Results look like this:

Without tx_abdicate

  • Plain LAN2WAN 959Mbit/s, 1134Kp/s
  • Plain WAN2LAN 958Mbit/s, 1176Kp/s
  • IPsec AES-128-GCM LAN2WAN 720Mbit/s, 204Kp/s
  • IPsec AES-128-GCM WAN2LAN 320Mbit/s, 147Kp/s

With tx_abdicate

  • Plain LAN2WAN 958Mbit/s, 1120Kp/s
  • Plain WAN2LAN 958Mbit/s, 1186Kp/s
  • IPsec AES-128-GCM LAN2WAN 829Mbit/s, 242Kp/s
  • IPsec AES-128-GCM WAN2LAN 326Mbit/s, 84Kp/s

I can not day, that tx_abdicate make better or worse in case without IPsec. I'm limited by traffic generator, not by router in both cases (it could be confirmed by CPU usage numbers on router/DUT, really).
But IPsec case is other story: LAN2WAN is much better, but WAN2LAN is much worse in term of Kpps. It is what I don't linke about `tx_abdicate', as it is not clear is it win or loss.

Now I'm collecting data with same latest sources but without this patch.

lev added a comment.Fri, Jan 18, 5:01 PM

And without this patch

Without tx_abdicate

  • Plain LAN2WAN 959Mbit/s, 1146Kp/s
  • Plain WAN2LAN 958Mbit/s, 1176Kp/s
  • IPsec AES-128-GCM LAN2WAN 728Mbit/s, 206Kp/s
  • IPsec AES-128-GCM WAN2LAN 325Mbit/s, 151Kp/s

With tx_abdicate

  • Plain LAN2WAN 959Mbit/s, 1133Kp/s
  • Plain WAN2LAN 958Mbit/s, 1186Kp/s
  • IPsec AES-128-GCM LAN2WAN 841Mbit/s, 247Kp/s
  • IPsec AES-128-GCM WAN2LAN 322Mbit/s, 83Kp/s

Looks like this patch does nothing in my case, and tx_abdicate is still questionable.