cxgbe(4): changes in the Tx path to help increase tx coalescing.
ClosedPublic
Actions

Authored by np on Jun 25 2020, 5:08 PM.

Details

Reviewers

jhb

Commits

rS362905: cxgbe(4): changes in the Tx path to help increase tx coalescing.

Summary

Ask the firmware for the number of frames that can be stuffed in one work request.

Modify mp_ring to increase the likelihood of tx coalescing when there are just one or two threads that are doing most of the tx. Add teeth to the abdication mechanism by pushing the consumer lock into mp_ring. This reduces the likelihood that a consumer will get stuck with all the work even though it is above its budget.

Add support for coalesced tx WR to the VF driver. This, with the changes above, results in a 7x improvement in the tx pps of the VF driver for some common cases. The firmware vets the L2 headers submitted by the VF driver and it's a big win if the checks are performed for a batch of packets and not each one individually.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

np created this revision.Jun 25 2020, 5:08 PM

Herald added subscribers: ae, imp. · View Herald TranscriptJun 25 2020, 5:08 PM

Harbormaster completed remote builds in B31958: Diff 73646.Jun 25 2020, 5:08 PM

np requested review of this revision.Jun 25 2020, 5:08 PM

gallatin added a subscriber: olivier.Jun 26 2020, 2:59 PM

First result on a 'small' server Xeon E5 2650 8 cores with a 10G Chelsio T540-CR (one port used as RX and the other as TX) :

x r362778: inet4 packets-per-second forwarded
+ r362778 with D25454: inet4 packets-per-second forwarded
+--------------------------------------------------------------------------+
|                                                                         +|
|                                                                         +|
|                                                                         +|
|     x                                                                   +|
|x   xx  x                                                                +|
|  |__A_|                                                                  |
|                                                                         A|
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      11767747      11782957      11777032      11776267     5465.3431
+   5      11905318      11905348      11905340      11905335     11.436783
Difference at 95.0% confidence
        129068 +/- 5636.28
        1.096% +/- 0.0483859%
        (Student's t, pooled s = 3864.59)

So it's a 1% improvement on this hardware & use case.

As far as I understand it, this looks ok to me.

sys/dev/cxgbe/t4_mp_ring.c
144 ↗	(On Diff #73646)	Maybe use `atomic_load_64` here (no need for _acq and it's just a NOP, but more for documentation)
171 ↗	(On Diff #73646)	I don't think you need the 'acq' barrier/fence here, only in the atomic_fcmpset() below.