Page MenuHomeFreeBSD

iflib - try m_defrag when m_collapse fails
ClosedPublic

Authored by krzysztof.galazka_intel.com on Jul 4 2017, 2:09 PM.
Tags
None
Referenced Files
F108364402: D11476.id31284.diff
Fri, Jan 24, 4:35 AM
Unknown Object (File)
Sat, Jan 18, 5:16 PM
Unknown Object (File)
Sat, Jan 18, 12:02 AM
Unknown Object (File)
Mon, Dec 30, 3:18 AM
Unknown Object (File)
Thu, Dec 26, 7:23 PM
Unknown Object (File)
Nov 21 2024, 10:59 PM
Unknown Object (File)
Nov 17 2024, 3:03 AM
Unknown Object (File)
Oct 29 2024, 2:12 AM

Details

Summary
  • For certain range of send sizes (usually between 256 and 512 bytes)

TSO stack tends to pass to iflib_encap mbuf chains which cannot be mapped
to segments with iflib_busdma_load_mbuf_sg. In that case m_collapse is
called but it also fails. Currnet implementation drops such mbuf chain
causing drastic loss of TX performance. There are no rejected request
for mbufs and clusters in netstat output when it happens. Tests show
that calling m_defrag when m_collapse failed works in almost 100% cases
and solves the problem. Lowering by 1 number of max segments set in ifp
struct usually eliminates need to call m_collapse at all and gives even
better results for streams with such sends.

Example results of testing with netperf:

  1. Before the patch:
[root@u2002 ~]# netperf -P0 -H u2020 -t TCP_STREAM -l 10 -- -H u2020-2 -m 512 -M 512
87380 1048576    512    10.17     391.77   

[root@u2002 ~]# netstat –m
16130/25375/41505 mbufs in use (current/cache/total)
16072/12204/28276/4194304 mbuf clusters in use (current/cache/total/max)
8213/12027 mbuf+clusters out of packet secondary zone in use (current/cache)
0/3/3/2097152 4k (page size) jumbo clusters in use (current/cache/total/max)
0/0/0/1058683 9k jumbo clusters in use (current/cache/total/max)
0/0/0/595509 16k jumbo clusters in use (current/cache/total/max)
36176K/30763K/66940K bytes allocated to network (current/cache/total)
0/0/0 requests for mbufs denied (mbufs/clusters/mbuf+clusters)
0/0/0 requests for mbufs delayed (mbufs/clusters/mbuf+clusters)
0/0/0 requests for jumbo clusters delayed (4k/9k/16k)
0/0/0 requests for jumbo clusters denied (4k/9k/16k)
0 sendfile syscalls
0 sendfile syscalls completed without I/O request
0 requests for I/O initiated by sendfile
0 pages read by sendfile as part of a request
0 pages were valid at time of a sendfile request
0 pages were valid and substituted to bogus page
0 pages were requested for read ahead by applications
0 pages were read ahead by sendfile
0 times sendfile encountered an already busy page
0 requests for sfbufs denied
0 requests for sfbufs delayed

[root@u2002 ~]# sysctl dev.ix.0 | grep mbuf
dev.ix.0.iflib.txq21.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq21.mbuf_defrag: 0
dev.ix.0.iflib.txq20.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq20.mbuf_defrag: 0
dev.ix.0.iflib.txq19.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq19.mbuf_defrag: 0
dev.ix.0.iflib.txq18.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq18.mbuf_defrag: 0
dev.ix.0.iflib.txq17.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq17.mbuf_defrag: 0
dev.ix.0.iflib.txq16.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq16.mbuf_defrag: 0
dev.ix.0.iflib.txq15.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq15.mbuf_defrag: 0
dev.ix.0.iflib.txq14.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq14.mbuf_defrag: 0
dev.ix.0.iflib.txq13.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq13.mbuf_defrag: 0
dev.ix.0.iflib.txq12.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq12.mbuf_defrag: 0
dev.ix.0.iflib.txq11.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq11.mbuf_defrag: 0
dev.ix.0.iflib.txq10.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq10.mbuf_defrag: 0
dev.ix.0.iflib.txq09.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq09.mbuf_defrag: 0
dev.ix.0.iflib.txq08.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq08.mbuf_defrag: 0
dev.ix.0.iflib.txq07.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq07.mbuf_defrag: 0
dev.ix.0.iflib.txq06.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq06.mbuf_defrag: 0
dev.ix.0.iflib.txq05.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq05.mbuf_defrag: 0
dev.ix.0.iflib.txq04.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq04.mbuf_defrag: 0
dev.ix.0.iflib.txq03.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq03.mbuf_defrag: 0
dev.ix.0.iflib.txq02.mbuf_defrag_failed: 36
dev.ix.0.iflib.txq02.mbuf_defrag: 3
dev.ix.0.iflib.txq01.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq01.mbuf_defrag: 0
dev.ix.0.iflib.txq00.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq00.mbuf_defrag: 0

Average TX performance in 20 runs of netperf:

  • msg size 512: 388 Mbps
  • msg size 256K: 4372 Mbps
  1. Patch without "-1":

Average TX performance:

  • msg size 512: 3119 Mbps
  • msg size 256K: 4367 Mbps
sysctl dev.ix.0 | grep mbuf:
dev.ix.0.iflib.txq21.mbuf_defrag_failed: 1
dev.ix.0.iflib.txq21.mbuf_collapse_failed: 11904
dev.ix.0.iflib.txq21.mbuf_defrag: 12607
dev.ix.0.iflib.txq20.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq20.mbuf_collapse_failed: 0
dev.ix.0.iflib.txq20.mbuf_defrag: 0
dev.ix.0.iflib.txq19.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq19.mbuf_collapse_failed: 0
dev.ix.0.iflib.txq19.mbuf_defrag: 0
dev.ix.0.iflib.txq18.mbuf_defrag_failed: 1
dev.ix.0.iflib.txq18.mbuf_collapse_failed: 10609
dev.ix.0.iflib.txq18.mbuf_defrag: 11225
dev.ix.0.iflib.txq17.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq17.mbuf_collapse_failed: 10858
dev.ix.0.iflib.txq17.mbuf_defrag: 11492
dev.ix.0.iflib.txq16.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq16.mbuf_collapse_failed: 15449
dev.ix.0.iflib.txq16.mbuf_defrag: 16250
dev.ix.0.iflib.txq15.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq15.mbuf_collapse_failed: 0
dev.ix.0.iflib.txq15.mbuf_defrag: 0
dev.ix.0.iflib.txq14.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq14.mbuf_collapse_failed: 10457
dev.ix.0.iflib.txq14.mbuf_defrag: 11064
dev.ix.0.iflib.txq13.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq13.mbuf_collapse_failed: 11654
dev.ix.0.iflib.txq13.mbuf_defrag: 12331
dev.ix.0.iflib.txq12.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq12.mbuf_collapse_failed: 0
dev.ix.0.iflib.txq12.mbuf_defrag: 0
dev.ix.0.iflib.txq11.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq11.mbuf_collapse_failed: 0
dev.ix.0.iflib.txq11.mbuf_defrag: 0
dev.ix.0.iflib.txq10.mbuf_defrag_failed: 3
dev.ix.0.iflib.txq10.mbuf_collapse_failed: 23030
dev.ix.0.iflib.txq10.mbuf_defrag: 24298
dev.ix.0.iflib.txq09.mbuf_defrag_failed: 1
dev.ix.0.iflib.txq09.mbuf_collapse_failed: 11100
dev.ix.0.iflib.txq09.mbuf_defrag: 11719
dev.ix.0.iflib.txq08.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq08.mbuf_collapse_failed: 0
dev.ix.0.iflib.txq08.mbuf_defrag: 0
dev.ix.0.iflib.txq07.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq07.mbuf_collapse_failed: 11064
dev.ix.0.iflib.txq07.mbuf_defrag: 11703
dev.ix.0.iflib.txq06.mbuf_defrag_failed: 1
dev.ix.0.iflib.txq06.mbuf_collapse_failed: 7404
dev.ix.0.iflib.txq06.mbuf_defrag: 7802
dev.ix.0.iflib.txq05.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq05.mbuf_collapse_failed: 35825
dev.ix.0.iflib.txq05.mbuf_defrag: 37899
dev.ix.0.iflib.txq04.mbuf_defrag_failed: 1
dev.ix.0.iflib.txq04.mbuf_collapse_failed: 7323
dev.ix.0.iflib.txq04.mbuf_defrag: 7728
dev.ix.0.iflib.txq03.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq03.mbuf_collapse_failed: 0
dev.ix.0.iflib.txq03.mbuf_defrag: 0
dev.ix.0.iflib.txq02.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq02.mbuf_collapse_failed: 17819
dev.ix.0.iflib.txq02.mbuf_defrag: 18846
dev.ix.0.iflib.txq01.mbuf_defrag_failed: 1
dev.ix.0.iflib.txq01.mbuf_collapse_failed: 29036
dev.ix.0.iflib.txq01.mbuf_defrag: 30658
dev.ix.0.iflib.txq00.mbuf_defrag_failed: 0
dev.ix.0.iflib.txq00.mbuf_collapse_failed: 0
dev.ix.0.iflib.txq00.mbuf_defrag: 0
  1. Patch with "-1":

Average TX performance:

  • msg size 512: 3440 Mbps
  • msg size 256K: 4361 Mbps
Test Plan

Basic performance testing with netperf done.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

On the first platform: PC Engines APU2C4 (quad core AMD GX-412T Processor 1 GHz), 3 Intel i210AT Gigabit Ethernet ports

  • LRO/TSO disabled
  • harvest.mask=351
  • 2000 flows of smallest UDP packets
  • Traffic load at 1.448Mpps (Gigabit line-rate)

graph.png (1×1 px, 103 KB)

inet 4 forwarding performance is improved (+2.9%):

x r320657 inet4 packets-per-second
+ r320657 with D11476, inet4 packets-per-second
+--------------------------------------------------------------------------+
|        x                                                          +      |
|xx      x x                                                       ++   + +|
| |___A__M_|                                                               |
|                                                                  |M_A__| |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        462894        465011        464490        464002      951.0602
+   5        476909        478369        477130      477482.2     648.55663
Difference at 95.0% confidence
        13480.2 +/- 1187.15
        2.9052% +/- 0.260947%
        (Student's t, pooled s = 813.984)

inet6 forwarding performance not impacted:

x r320657, inet6 packets-per-second
+ r320657 with D11476, inet6 packets-per-second
+--------------------------------------------------------------------------+
|+       +      +       x   x           +   x   +                         x|
|                        |__________________M____A_______________________| |
| |_____________M______A___________________|                               |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        433045      438900.5      435451.5        435969     2818.8255
+   5      430381.5        435821        432123      432897.2     2349.4506
No difference proven at 95.0% confidence

On the second platform: Netgate RCC-VE 4860 (4 cores Intel Atom C2558E) with Quad port Intel i350.

  • LRO/TSO disabled
  • harvest.mask=351
  • 2000 flows of smallest UDP packets
  • Traffic load at 1.448Mpps (Gigabit line-rate)

graph.png (1×1 px, 98 KB)

inet4 forwarding: -1.41%

x r320657, inet4 packets-per-second
+ r320657 with D11476, inet4 packets-per-second
+--------------------------------------------------------------------------+
|+       +            +    +   +    x                        xx x         x|
|                                            |_____________A__M__________| |
|    |____________A___M________|                                           |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5      819142.5      829855.5        826483        825722     3955.9076
+   5        809218        817676        815034        814025     3565.3156
Difference at 95.0% confidence
        -11697 +/- 5492.03
        -1.41658% +/- 0.659936%
        (Student's t, pooled s = 3765.68)

No impact on inet6 forwarding:

x r320657, inet6 packets-per-second
+ r320657 with D11476, inet6 packets-per-second

+--------------------------------------------------------------------------+
|                           +                                              |
|x            x    x     +  +     x           +  x                        +|
|    |_____________M___A__________________|                                |
|                  |________M___________A____________________|             |
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        741769        759642        748553        750108      6917.188
+   5        750663        768919        751712        756294     7721.3557
No difference proven at 95.0% confidence

@olivier - Could you tell me what tools do you use? I'd like to check if forwarding performance on other HW is also affected.

Thanks for your review.

sys/net/iflib.c
3091 ↗(On Diff #30398)

Most of this shuffling here is gratuitous. Please just put collapse failed stat increment under the remap 0 condition.

4154 ↗(On Diff #30398)

This is the wrong place to fix this. The particular piece of hardware you're working on isn't the only device that iflib supports or can support. isc_tx_tso_segments_max is set in the driver. Please update the value there.

sbruno requested changes to this revision.Jul 5 2017, 7:16 PM
This revision now requires changes to proceed.Jul 5 2017, 7:16 PM

Thanks for your review.

Thanks for your comments.

sys/net/iflib.c
3091 ↗(On Diff #30398)

Could you elaborate? The point is to try m_defrag immediately after m_collapse returns NULL and not jump to defrag_failed where whole mbuf chain is dropped. It would be better to call m_defrag only if m_collapse failure wasn't caused by m_getcl but I don't see a way to do it without changing implementation of m_collapse to return reason why it failed to shorten the mbuf chain.

4154 ↗(On Diff #30398)

isc_tx_tso_segments_max is used in two places. Here to set if_hw_tsomaxsegcount and in iflib_encap to set max_segs for iflib_busdma_load_muf_sg. The problem is not with device supporting different value but with iflib_busdma_load_mbuf_sg failing to map mbuf chain to segments. My guess is that first mbuf in such chain has buffer not aligned to page boundary so it have to be mapped to more then one segment which causes both iflib_busdma_load_mbuf_sg and m_collapse to fail. My patch saves one segment for such situation and eliminates need to go through m_collapse and m_defrag. I see around 9% better TX performance for small send size and only around 0.1% less for large send size.

Although I don't have access to platforms other then x86 so I think it might be good idea add an ifdef to limit the impact.

@olivier - Could you tell me what tools do you use? I'd like to check if forwarding performance on other HW is also affected.

I'm using a patched version of netmap pkt-gen.
Here is the generic documentation and a specific example with the PC Engines APU2.
I'm generating smallest size packet at line-rate:

pkt-gen -U -i igb2 -f tx -n 80000000 -l 60 -d 198.19.10.1:2000-198.19.10.20 -D 00:0d:b9:41:ca:3d -s 198.18.10.1:2000-198.18.10.100 -w 4

With some specific options:

  • -U, intel drivers disable hardware acceleration features in netmap mode, then I need to ask pkt-gen to do software checksum
  • -l 60: size of the Ethernet frame (excluding CRC), I'm interressted by the worse scenario (smallest packet)
  • -d range, and -s range: I'm generating about 2000 differents UDP flows

The forwarding test is not rely on the described feature. It uses small packets that are each fits into one mbuf and there is no need to collapse or defrag them.

From a quick look, the iflib code does not bind irq to CPU cores. The old em/igb drivers did that and I guess, if you add bus_bind_intr() again, this will increase the performance.

In D11476#238004, @ae wrote:

From a quick look, the iflib code does not bind irq to CPU cores. The old em/igb drivers did that and I guess, if you add bus_bind_intr() again, this will increase the performance.

Good catch, then I've adapted my old ix_affinity RC script to igb, but I didn't see lot's difference (+0.34%):

x r320657 with D11476, inet4 packets-per-second
+ r320657 with D11476, igb_affinity enabled, inet4 packets-per-second
+--------------------------------------------------------------------------+
|                                                                       +  |
|  x x x      +          x     + x                  +                   +  |
||_____M______A_____________|                                              |
|                      |________________________A___M_____________________||
+--------------------------------------------------------------------------+
    N           Min           Max        Median           Avg        Stddev
x   5        476909        478369        477130      477482.2     648.55663
+   5        477469        480270        479316      479118.2     1235.9772
Difference at 95.0% confidence
        1636 +/- 1439.46
        0.342631% +/- 0.301691%
        (Student's t, pooled s = 986.982)
In D11476#238004, @ae wrote:

From a quick look, the iflib code does not bind irq to CPU cores. The old em/igb drivers did that and I guess, if you add bus_bind_intr() again, this will increase the performance.

It does. The problem is that FreeBSD goes through the entire forwarding path for every single packet, not supporting any sort of batching. There is a doorbell coalesce optimization for content serving that is a pessimization for forwarding on FreeBSD.

sys/net/iflib.c
3091 ↗(On Diff #30398)

I was confused by the fact that you're calling remap twice. There's no reason to support a second pass if you're just going to call m_defrag straight away.

Also - I think what you're saying is we need to change the m_collapse interface to one that is more useful. Pass in a **mbuf for the result and have it return an error. The APIs are not immutable. We can just keep around a local copy in iflib for compat on older branches.

4154 ↗(On Diff #30398)

You're making a change for all drivers based on the value being, in effect, set wrong for your driver.

In D11476#238004, @ae wrote:

From a quick look, the iflib code does not bind irq to CPU cores. The old em/igb drivers did that and I guess, if you add bus_bind_intr() again, this will increase the performance.

It does. The problem is that FreeBSD goes through the entire forwarding path for every single packet, not supporting any sort of batching. There is a doorbell coalesce optimization for content serving that is a pessimization for forwarding on FreeBSD.

Look at subr_gtaskqueue

sys/net/iflib.c
4154 ↗(On Diff #30398)

This may make sense if you can establish that the same difference holds true for all drivers with segments / alignment. Otherwise it's best to admit that you can't reliably use that extra segment - at least on FreeBSD

cramerj_intel.com added inline comments.
sys/net/iflib.c
2981 ↗(On Diff #30398)

iflib_busdma_load_mbuf_sg() doesn't fail (EFBIG) because of a hardware limitation. It fails because the chain requires >= segments than the limit we specified (i >= max_segs). Should this conditional/goto (i >= max_segs) be moved to the beginning of this while-loop? If you imagine a case of setting the driver to only allow one segment, then this function will fail every time. But putting this conditional at the top will allow the segs struct to be filled out correctly for the one allowed segment.

sys/net/iflib.c
2981 ↗(On Diff #30398)

Indeed. We should retest with this change instead.

krzysztof.galazka_intel.com added inline comments.
sys/net/iflib.c
2981 ↗(On Diff #30398)

Fix proposed by Jeb solves the issue with m_collapse. I'm updating the review.

krzysztof.galazka_intel.com edited edge metadata.

Proper fix for segments accounting in iflib_busdma_load_mbuf_sg.

This revision is now accepted and ready to land.Jul 27 2017, 4:27 PM
This revision was automatically updated to reflect the committed changes.