Page MenuHomeFreeBSD

cxgbe(4): changes in the Tx path to help increase tx coalescing.
ClosedPublic

Authored by np on Jun 25 2020, 5:08 PM.

Details

Summary
  • Ask the firmware for the number of frames that can be stuffed in one work request.
  • Modify mp_ring to increase the likelihood of tx coalescing when there are just one or two threads that are doing most of the tx. Add teeth to the abdication mechanism by pushing the consumer lock into mp_ring. This reduces the likelihood that a consumer will get stuck with all the work even though it is above its budget.
  • Add support for coalesced tx WR to the VF driver. This, with the changes above, results in a 7x improvement in the tx pps of the VF driver for some common cases. The firmware vets the L2 headers submitted by the VF driver and it's a big win if the checks are performed for a batch of packets and not each one individually.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

np requested review of this revision.Jun 25 2020, 5:08 PM

First result on a 'small' server Xeon E5 2650 8 cores with a 10G Chelsio T540-CR (one port used as RX and the other as TX) :

x r362778: inet4 packets-per-second forwarded
+ r362778 with D25454: inet4 packets-per-second forwarded
+--------------------------------------------------------------------------+
|                                                                         +|
|                                                                         +|
|                                                                         +|
|     x                                                                   +|
|x   xx  x                                                                +|
|  |__A_|                                                                  |
|                                                                         A|
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      11767747      11782957      11777032      11776267     5465.3431
+   5      11905318      11905348      11905340      11905335     11.436783
Difference at 95.0% confidence
        129068 +/- 5636.28
        1.096% +/- 0.0483859%
        (Student's t, pooled s = 3864.59)

So it's a 1% improvement on this hardware & use case.

As far as I understand it, this looks ok to me.

sys/dev/cxgbe/t4_mp_ring.c
144 ↗(On Diff #73646)

Maybe use atomic_load_64 here (no need for _acq and it's just a NOP, but more for documentation)

171 ↗(On Diff #73646)

I don't think you need the 'acq' barrier/fence here, only in the atomic_fcmpset() below.

This revision is now accepted and ready to land.Jun 30 2020, 9:39 PM

Incorporate feedback. atomic_load_64 is now used everywhere to read state.

This revision now requires review to proceed.Jul 3 2020, 12:17 AM
This revision was not accepted when it landed; it landed in state Needs Review.Jul 3 2020, 4:44 AM
This revision was automatically updated to reflect the committed changes.