Page MenuHomeFreeBSD

Fix mlx4en(4) to properly call m_defrag.
ClosedPublic

Authored by mjoras on Jul 20 2017, 10:51 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Apr 13, 2:51 PM
Unknown Object (File)
Wed, Apr 10, 5:34 AM
Unknown Object (File)
Tue, Apr 9, 8:36 PM
Unknown Object (File)
Thu, Apr 4, 3:27 PM
Unknown Object (File)
Thu, Apr 4, 12:44 PM
Unknown Object (File)
Feb 22 2024, 11:13 AM
Unknown Object (File)
Feb 21 2024, 4:37 PM
Unknown Object (File)
Jan 13 2024, 3:49 AM
Subscribers

Details

Summary

When mlx4en(4) was convered to using BUSDMA(9) the call to m_defrag was
moved after the part of the tx routine that strips the header from the
mbuf chain. Before it called m_defrag it first trimmed off the now-empty
mbufs from the start of the chain. This has the side effect of also
removing the head of the chain that has M_PKTHDR set. m_defrag will not
defrag a chain that does not have M_PKTHDR set, thus it was effectively
never defragging the mbuf chains.

As it turns out, trimming the mbufs in this fashion is unnecessary since
the call to bus_dmamap_load_mbuf_sg doesn't map empty mbufs anyway, so
remove it.

Additionally this introduces a counter for defrag_attempts, fixes an
outstanding issue with zeroing the oversized_packets counter, and makes
the tso_packets counter per-ring to avoid excessive cache misses in the
tx path.

Test Plan

This has been tested internally at Isilon where we end up creating a lot
of long mbuf chains, so the lack of m_defrag working was apparent.

iperf / sysctl results:

satella-1# iperf -c 2.0.0.20
------------------------------------------------------------
Client connecting to 2.0.0.20, TCP port 5001
TCP window size:   131 KByte (default)
------------------------------------------------------------
[  4] local 2.0.0.19 port 10013 connected with 2.0.0.20 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  22.6 GBytes  19.4 Gbits/sec
satella-1# iperf -c 2.0.0.20 -P 5
------------------------------------------------------------
Client connecting to 2.0.0.20, TCP port 5001
TCP window size:   131 KByte (default)
------------------------------------------------------------
[  7] local 2.0.0.19 port 10017 connected with 2.0.0.20 port 5001
[  8] local 2.0.0.19 port 10018 connected with 2.0.0.20 port 5001
[  6] local 2.0.0.19 port 10016 connected with 2.0.0.20 port 5001
[  4] local 2.0.0.19 port 10014 connected with 2.0.0.20 port 5001
[  5] local 2.0.0.19 port 10015 connected with 2.0.0.20 port 5001
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.0 sec  9.36 GBytes  8.01 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  8.64 GBytes  7.40 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  7]  0.0-10.0 sec  11.0 GBytes  9.40 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  8]  0.0-10.0 sec  8.74 GBytes  7.48 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  6]  0.0-10.0 sec  8.49 GBytes  7.27 Gbits/sec
[SUM]  0.0-10.0 sec  46.2 GBytes  39.6 Gbits/sec

satella-1# sysctl hw.mlxen1.stat | grep tso_packets
hw.mlxen1.stat.tx_ring23.tso_packets: 381
hw.mlxen1.stat.tx_ring22.tso_packets: 0
hw.mlxen1.stat.tx_ring21.tso_packets: 0
hw.mlxen1.stat.tx_ring20.tso_packets: 307702
hw.mlxen1.stat.tx_ring19.tso_packets: 0
hw.mlxen1.stat.tx_ring18.tso_packets: 0
hw.mlxen1.stat.tx_ring17.tso_packets: 0
hw.mlxen1.stat.tx_ring16.tso_packets: 0
hw.mlxen1.stat.tx_ring15.tso_packets: 1
hw.mlxen1.stat.tx_ring14.tso_packets: 1
hw.mlxen1.stat.tx_ring13.tso_packets: 0
hw.mlxen1.stat.tx_ring12.tso_packets: 0
hw.mlxen1.stat.tx_ring11.tso_packets: 0
hw.mlxen1.stat.tx_ring10.tso_packets: 1269299
hw.mlxen1.stat.tx_ring9.tso_packets: 307674
hw.mlxen1.stat.tx_ring8.tso_packets: 1
hw.mlxen1.stat.tx_ring7.tso_packets: 0
hw.mlxen1.stat.tx_ring6.tso_packets: 1
hw.mlxen1.stat.tx_ring5.tso_packets: 0
hw.mlxen1.stat.tx_ring4.tso_packets: 308278
hw.mlxen1.stat.tx_ring3.tso_packets: 0
hw.mlxen1.stat.tx_ring2.tso_packets: 307541
hw.mlxen1.stat.tx_ring1.tso_packets: 305629
hw.mlxen1.stat.tx_ring0.tso_packets: 1
hw.mlxen1.stat.tso_packets: 2806509

satella-1# iperf -s
------------------------------------------------------------
Server listening on TCP port 5001
TCP window size:   128 KByte (default)
------------------------------------------------------------
[  5] local 2.0.0.19 port 5001 connected with 2.0.0.20 port 29913
[  6] local 2.0.0.19 port 5001 connected with 2.0.0.20 port 48204
[  8] local 2.0.0.19 port 5001 connected with 2.0.0.20 port 53588
[  7] local 2.0.0.19 port 5001 connected with 2.0.0.20 port 12372
[  9] local 2.0.0.19 port 5001 connected with 2.0.0.20 port 52427
[ ID] Interval       Transfer     Bandwidth
[  5]  0.0-10.0 sec  9.08 GBytes  7.77 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  6]  0.0-10.0 sec  9.61 GBytes  8.23 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  8]  0.0-10.0 sec  9.23 GBytes  7.90 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  7]  0.0-10.0 sec  9.21 GBytes  7.88 Gbits/sec
[ ID] Interval       Transfer     Bandwidth
[  9]  0.0-10.0 sec  9.04 GBytes  7.74 Gbits/sec
[SUM]  0.0-10.0 sec  46.2 GBytes  39.5 Gbits/sec

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Hi, Your patch basically looks good. Please give Mellanox some time to review. I'm currently having vaction.

sys/dev/mlx4/mlx4_en/mlx4_en_tx.c
842 ↗(On Diff #31390)

else {
/* if all data has been inlined we don't need the mbuf any more */
bus_dmamap_unload(ring->dma_tag, tx_info->dma_map);
m_freem(mb);
mb = NULL;
}

sys/dev/mlx4/stats.h
129 ↗(On Diff #31390)

I see you've added some new port statistics, but there is no corresponding SYSCTL node added?

BTW: I think the same issue exists for mlx5en.

  • Forgot to add the sysctl for defrag_attempts.
mjoras marked an inline comment as done.
  • unload / free the mbuf if everything was inlined
mjoras added inline comments.
sys/dev/mlx4/stats.h
129 ↗(On Diff #31390)

Nice catch. I missed that file in git, my mistake.

sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c
2686 ↗(On Diff #31431)

Did you missing adding a node for tso_packets ?

sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c
2686 ↗(On Diff #31431)

tso_packets already has a node at line 2665. I just made the port stat aggregated instead of directly incremented.

BTW: I think the same issue exists for mlx5en.

Would you like me to make a separate revision for mlx5en or combine it into this one?

Make separate patch for mlx5en. It should look very much the same like this one with regard to the TX output code.

sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c
2686 ↗(On Diff #31431)

Right, but there is also a node per TX ring.

sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c
2686 ↗(On Diff #31431)

Oh I understand. We didn't add that locally but it makes sense to have it so I'll add it in.

  • add per-ring sysctl node for tso_packets

Have you tested this patch?

I've tested the original, without the per-ring sysctl node for tso_packets. The m_defrag relevant bits have been in the Isilon codebase for a couple weeks now so it's been run fairly continuously since then. The tso_packets part of the diff has actually been in the codebase for over a year.

Would you like me to test the per-ring sysctl additions on hardware as well?

Can run this patch through some basic tests, like iperf and sysctl -a ?

Here's an older test output that shows defrag_attempts and oversized_packets:

durinf004-1: hw.mlxen1.stat.tx_oversized_packets: 0
durinf004-1: hw.mlxen1.stat.defrag_attempts: 117
durinf004-2: hw.mlxen1.stat.tx_oversized_packets: 0
durinf004-2: hw.mlxen1.stat.defrag_attempts: 1
Fri Jul 21 18:46:25 PDT 2017
durinf004-1: hw.mlxen1.stat.tx_oversized_packets: 0
durinf004-1: hw.mlxen1.stat.defrag_attempts: 121
durinf004-2: hw.mlxen1.stat.tx_oversized_packets: 0
durinf004-2: hw.mlxen1.stat.defrag_attempts: 1
durinf004-1# tail -n 10 logfile
Sun Jul 23 23:00:25 PDT 2017
durinf004-1: hw.mlxen1.stat.tx_oversized_packets: 0
durinf004-1: hw.mlxen1.stat.defrag_attempts: 9306
durinf004-2: hw.mlxen1.stat.tx_oversized_packets: 0
durinf004-2: hw.mlxen1.stat.defrag_attempts: 47
Sun Jul 23 23:00:55 PDT 2017
durinf004-1: hw.mlxen1.stat.tx_oversized_packets: 0
durinf004-1: hw.mlxen1.stat.defrag_attempts: 9306
durinf004-2: hw.mlxen1.stat.tx_oversized_packets: 0
durinf004-2: hw.mlxen1.stat.defrag_attempts: 47
sys/dev/mlx4/mlx4_en/mlx4_en_netdev.c
2780 ↗(On Diff #31438)

Please add a node for defrag_attempts aswell!

  • add per-ring sysctl for defrag_attempts

Hi,

We are currently testing this patch internally. Testing will be done by Monday. Do you mind if I push it, so I can have it my MFC queue?

--HPS

This revision is now accepted and ready to land.Aug 3 2017, 9:03 AM
sys/dev/mlx4/mlx4_en/mlx4_en_tx.c
838 ↗(On Diff #31474)

Just looking at the diff, and not the full context, but can you just skip calling busdma entirely if you've inlined everything? It seems like that would improve small packet performance..

Hi,

We are currently testing this patch internally. Testing will be done by Monday. Do you mind if I push it, so I can have it my MFC queue?

--HPS

Sure, you can commit it and MFC as appropriate.

sys/dev/mlx4/mlx4_en/mlx4_en_tx.c
838 ↗(On Diff #31474)

By skip entirely do you mean not calling bus_dmamap_load_mbuf_sg? That's essentially what the original version was doing with the skip_dma label. I could have written this change to do that and just not free the first mbufs in the chain, or otherwise preserved M_PKTHDR, but removing the loop and just letting busdma iterate over the chain seemed simpler, and I figured the initial busdma load would be cheap since it's essentially a no-op for loop.

I don't have a great way to really benchmark small packet performance but I'd be amenable to changing it if a regression in small packet perf could be shown relative to the original.

gallatin added inline comments.
sys/dev/mlx4/mlx4_en/mlx4_en_tx.c
838 ↗(On Diff #31474)

Well, I guess it should be cheap, since they are always in cache.

A very simple way to test small pkt perf is to do something like netperf -t $SOME_HOST -tUDP_STREAM -- -m 1. This seldom hits limits in NIC drivers though, since FreeBSD is generally so bad for pps performance, and you wind up having to use something like netgraph.

sys/dev/mlx4/mlx4_en/mlx4_en_tx.c
838 ↗(On Diff #31474)

Ah, yeah I've seen some pps performance benchmarking tools made in ng. rstone has just such a tool and I recall even kernel-kernel pps performance was never anywhere close to what was theoretically achievable... I've been told netmap can do better but I have no experience with it myself.

This revision was automatically updated to reflect the committed changes.