Page MenuHomeFreeBSD

Use busdma unconditionally in iflib
ClosedPublic

Authored by gallatin on Nov 8 2018, 12:06 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mar 4 2024, 9:52 AM
Unknown Object (File)
Feb 16 2024, 7:19 AM
Unknown Object (File)
Jan 13 2024, 11:14 AM
Unknown Object (File)
Dec 20 2023, 6:17 AM
Unknown Object (File)
Oct 30 2023, 4:14 AM
Unknown Object (File)
Oct 18 2023, 3:50 PM
Unknown Object (File)
Sep 29 2023, 6:28 PM
Unknown Object (File)
Sep 10 2023, 8:41 AM

Details

Summary

Iflib has a complex mechanism to choose between using busdma and raw pmap_kextract() at runtime. This added complexity makes the code harder to maintain, and arguably hides bugs.

The stated purpose of having the raw pmap_kextract() path alongside busdma was to improve performance. However, on my setup (dual ixl 40GbE interfaces on a Haswell based E5-2697 v3), I'm unable to measure any meaningful difference in either packet forwarding or packet drop rate with this patch versus the stock tree. We run a less extensive version of this patch at Netflix and have noticed no performance issues from using busdma in our CDN workload.

When doing this patch, I uncovered several pre-existing issues, mostly centered around failing to call bus_dmamap_unload(), and unneeded bus_dmamap_load() / pmap_kextract() on clusters which have not been reallocated in _iflib_fl_refill(). Note that these are not fixed here; I plan to tackle those in a separate review.

Note that you may want to hold off on reviewing until Olivier can verify that it does no harm on his forwarding setup and until I can test it on a Netflix workload. I think there may be several revisions of this patch.

Test Plan
  • Prove there are no performance regressions for small packet / high packet rate workloads using the same test setup
  • Run a Netflix CDN type workload over it to ensure that there are no correctness regressions.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Here is the first result on forwarding performance impact on low-end hardware (AMD GX-412TC, 4Cores and Intel i210AT NIC) :

x fbsd head r340244: Inet 4 packets-per-second
+ fbsd head r340244 with D17901: inet 4 packets-per-second
+--------------------------------------------------------------------------+
|+++  ++                                                          x  x  x x|
|                                                                  |__AM__||
||_MA_|                                                                    |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4        294377        298868      297140.5      296881.5     1970.9345
+   5      259300.5        262676      260389.5      260808.8     1396.9031
Difference at 95.0% confidence
        -36072.7 +/- 2645.15
        -12.1505% +/- 0.827845%
        (Student's t, pooled s = 1667.29)

There is a -12% performance degradation…But on a router receiving high pps, since iflib, we need to enable dev.igb.X.iflib.tx_abdicate=1.
So here is the impact with this sysctl enabled:

x fbsd head r340244 (and tx_abdicate=1): Inet 4 packets-per-second
+ fbsd head r340244 with D17901 (and tx_abdicate=1): inet 4 packets-per-second
+--------------------------------------------------------------------------+
|x            x+                +                +           +  x     +  x |
| |___________________________________AM__________________________________||
|                       |_____________________A__M__________________|      |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4        326726        339838     333622.25     333452.12     6551.6273
+   5        329354        339363        335510      334854.2     4014.1314
No difference proven at 95.0% confidence

No performance degradation with this sysctl (wich should be enabled on a router/firewall).

I've forgot to disable IP redirect in my previous bench.
Here are the other results, on 3 different small platforms (-5% on all).
All of these are using common tuning features for routing/firewalling which are:

  • net.inet.ip.redirect=0 and net.inet6.ip6.redirect=0, this allow to re-enable fastforwarding path
  • dev.igb|ix.X.iflib.tx_abdicate=1, because they are receiving maximum link packet rate

PC Engines APU2 (AMD GX-412TC, 4Cores and Intel i210AT NIC) :

x fbsd head r340244: Inet 4 packets-per-second
+ fbsd head r340244 with D17901: inet 4 packets-per-second
+--------------------------------------------------------------------------+
|        +                                                                 |
| ++     +            +                               x x       x         x|
|                                                    |______M_A________|   |
||_______A_______|                                                         |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4        826095        844780      831274.5        833356     8592.6654
+   5        776161      795315.5      782767.5      782884.2     7603.0743
Difference at 95.0% confidence
        -50471.8 +/- 12758.7
        -6.05645% +/- 1.48636%
        (Student's t, pooled s = 8042.11)

Netgate RCC-VE 4860 (quad core Intel Atom C2558 with Intel i350 NIC):

x fbsd head r340244: Inet 4 packets-per-second
+ fbsd head r340244 with D17901: inet 4 packets-per-second
+--------------------------------------------------------------------------+
|++      + +                                                   x x   x    x|
|                                                               |___AM__|  |
||____A____|                                                               |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1056868       1066561       1061918     1061127.6     3819.1714
+   4       1000961       1009651     1004882.2     1005094.1     4471.8518
Difference at 95.0% confidence
        -56033.5 +/- 6523.01
        -5.28056% +/- 0.59894%
        (Student's t, pooled s = 4111.6)

SuperMicro 5018A-FTN4 (8 cores Atom C2758, with 10G Intel 82599)

x fbsd head r340244: Inet 4 packets-per-second
+ fbsd head r340244 with D17901: inet 4 packets-per-second
+--------------------------------------------------------------------------+
|+   +      +   +           +                       x x x        x        x|
|                                                  |____M___A________|     |
| |_________MA_________|                                                   |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5     3628309.5       3722660       3645922     3664106.9     39321.273
+   5       3415369       3531073       3461932       3463892     45037.097
Difference at 95.0% confidence
        -200215 +/- 61657
        -5.46422% +/- 1.64359%
        (Student's t, pooled s = 42275.9)

Thanks Olivier. 5% is about what I had expected. I think I see a way to improve that though. We seem to be doing repeated virt to phys translation on clusters where we have copied out a small mbuf on the rx side in iflib_rxd_pkt_get(). We call rxd_frag_to_sd() with a FALSE arg, to prevent unmapping. Yet in _iflib_fl_refill(), we seem to always do the mapping, even when we do not re-allocate clusters. This is one bit that's going to be more expensive on low-end boxes, since it will now result in a busdma callback function being called. I think this is an actual bug that would impact systems w/IOMMUs

Let me see if I can fix this correctly & have you re-test.

Thanks!

  • Fixed a pre-existing bug where, when receiving small packets, rx clusters were not unmapped, yet they were re-mapped when refilling the ring. This was fixed by keeping track of the bus address in ifsd_ba, and simply re-using them rather than re-doing the virt to bus mapping
  • Noticed that ifsd_flags were essentially unused, except to track the allocation state of the ring entry. It seemed cheaper and easier to use a null / non-null ifsd_cl[] entry as an indication of whether or not a slot was allocated. Removing the flags pays for the new ifsd_ba tracking on the receive side, and potentially saves some cache /memory bandwidth on the tx side.

With these changes, my forwarding perf on my Haswell Xeon / ixl setup has improved from 13.9Mpps (stock and original patch) to 16.4Mpps.

Yes, I confirm a little improvement on small devices (same iflib.tx_abdicate=1 and ip_redirect disabled).

PC Engines APU2 (AMD GX-412TC, 4Cores and Intel i210AT NIC) :

x fbsd head r340244: inet4 packets-per-second
+ fbsd head r340244 with D17901: inet4 packets-per-second
* fbsd head r340244 with D17901-Diff50252: inet4 packets-per-second
+--------------------------------------------------------------------------+
|                                             *                            |
|                                             *                            |
|*                              ++  ++      + **               x    x     x|
|                                                             |___MA____|  |
|                               |___MA___|                                 |
|                |___________________A________M__________|                 |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   4        826095        844780      831274.5        833356     8592.6654
+   5        776161      795315.5      782767.5      782884.2     7603.0743
Difference at 95.0% confidence
        -50471.8 +/- 12758.7
        -6.05645% +/- 1.48636%
        (Student's t, pooled s = 8042.11)
*   5        724137        799801      798118.5      783743.6     33330.242
Difference at 95.0% confidence
        -49612.4 +/- 40956.2
        -5.95333% +/- 4.90112%
        (Student's t, pooled s = 25815.6)

Netgate RCC-VE 4860 (quad core Intel Atom C2558 with Intel i350 NIC):

x fbsd head r340244: inet4 packets-per-second
+ fbsd head r340244 with D17901: inet4 packets-per-second
* fbsd head r340244 with D17901-Diff50252: inet4 packets-per-second
+--------------------------------------------------------------------------+
|                        *                                                 |
|++      + +  *          *     *                               x x   x    x|
|                                                               |___AM__|  |
||____A____|                                                               |
|                |______AM_____|                                           |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5       1056868       1066561       1061918     1061127.6     3819.1714
+   4       1000961       1009651     1004882.2     1005094.1     4471.8518
Difference at 95.0% confidence
        -56033.5 +/- 6523.01
        -5.28056% +/- 0.59894%
        (Student's t, pooled s = 4111.6)
*   4       1012208       1027898     1022372.2     1021212.6      6548.778
Difference at 95.0% confidence
        -39915 +/- 8199.99
        -3.76156% +/- 0.763812%
        (Student's t, pooled s = 5168.64)

SuperMicro 5018A-FTN4 (8 cores Atom C2758, with 10G Intel 82599)

x fbsd head r340244: inet4 packets-per-second
+ fbsd head r340244 with D17901: inet4 packets-per-second
* fbsd head r340244 with D17901-Diff50252: inet4 packets-per-second
+--------------------------------------------------------------------------+
|                       *                                                  |
|+   +      +  *+       **  *                       x x x        x        x|
|                                                  |____M___A________|     |
| |_________MA_________|                                                   |
|                  |___AM___|                                              |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5     3628309.5       3722660       3645922     3664106.9     39321.273
+   5       3415369       3531073       3461932       3463892     45037.097
Difference at 95.0% confidence
        -200215 +/- 61657
        -5.46422% +/- 1.64359%
        (Student's t, pooled s = 42275.9)
*   5       3475303       3528248       3513999     3509041.7     19869.047
Difference at 95.0% confidence
        -155065 +/- 45434
        -4.23201% +/- 1.19836%
        (Student's t, pooled s = 31152.4)
This revision is now accepted and ready to land.Nov 13 2018, 9:25 PM
This revision was automatically updated to reflect the committed changes.