Details

Reviewers

hiren
jhb
erj

Group Reviewers

Intel Networking

Commits

rS286831: Increase EM_MAX_SCATTER to 64 such that the size of em_xmit()::segs…

Summary

For reasons that are not clear to me, NFS can shovel up to 34 segments for its adapters to handle. Expand EM_MAX_SCATTER to 64 to accomodate this as deadlocks and watchdog alerts fire when the card stalls out trying to DMA outside of its buffers.

Cleanup a bit of handling in TSO via patches proposed on the mailing lists for other NFS issues: https://lists.freebsd.org/pipermail/freebsd-net/2014-July/039306.html

This is what I propose to give to users for another round of testing.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=200221
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=199174
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=167500
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=195078
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=159294
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=140326
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=173137

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Event Timeline

sbruno updated this revision to Diff 7276.Jul 24 2015, 8:21 PM

sbruno retitled this revision from to TSO debugging.

sbruno updated this object.

sbruno edited the test plan for this revision. (Show Details)

Herald added a subscriber: imp. · View Herald TranscriptJul 24 2015, 8:21 PM

Eric:

This isn't a diff for patch, just a roadmap of the things I've tried to do to resolve the TSO/NFS nonsense. None of the diff in this review has had any positive effect on my failure case. Ideas for things to test next would be appreciated.

Merge to head.

Hrm ... puzzling.

Doubling EM_MAX_SCATTER seems to make the lockup with tso enabled go away.

This seems to indicate a mismatch in how we are handling DMA and what the card
is trying to do. Comments?

Then it's possible the hw max isn't 32. I've read several e1000 device datasheets and haven't seen a mention of the max # of buffers device can handle, so I don't really know what it could be. Maybe it's similar to ixgbe, and the actual hw can handle 40?

I would be curious to see what a netperf TCP_STREAM with -m 32 or 64 result is and see if the card gets a not-terrible throughput. I remember on Fortville I saw mbuf chains that were up to 66 mbufs in length being passed to _xmit() when I did that.

In D3192#65058, @erj wrote:

Then it's possible the hw max isn't 32. I've read several e1000 device datasheets and haven't seen a mention of the max # of buffers device can handle, so I don't really know what it could be. Maybe it's similar to ixgbe, and the actual hw can handle 40?

I would be curious to see what a netperf TCP_STREAM with -m 32 or 64 result is and see if the card gets a not-terrible throughput. I remember on Fortville I saw mbuf chains that were up to 66 mbufs in length being passed to _xmit() when I did that.

hrm ... "-m" ?

root@:~ # netperf -H 192.168.100.1 -n2 -W64 -p10000 -l600
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

65536  32768  32768    600.02    431.90   
root@:~ # netperf -H 192.168.100.1 -n2 -W32 -p10000 -l600
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

65536  32768  32768    600.01    406.11

! In D3192#65249, @sbruno wrote:

hrm ... "-m" ?

Yeah, I meant the test-specific "-m"/message size option for TCP_STREAM. It appeared to me that the stack would constantly append small mbufs into long-chained TSOs that it would send to the driver if you gave it really small message sizes.

In D3192#65269, @erj wrote:

! In D3192#65249, @sbruno wrote:

hrm ... "-m" ?

Yeah, I meant the test-specific "-m"/message size option for TCP_STREAM. It appeared to me that the stack would constantly append small mbufs into long-chained TSOs that it would send to the driver if you gave it really small message sizes.

oh ... derp.

root@:~ # netperf -H 192.168.100.1 -n2 -W32 -p10000 -l20 -- -m 64
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

65536  32768     64    20.10      31.58   
root@:~ # netperf -H 192.168.100.1 -n2 -W32 -p10000 -l20 -- -m 32
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

65536  32768     32    20.00      17.95

Those numbers don't look abysmal, I think. Try it without TSO, and if the numbers don't noticeably get better, then maybe your new limit of 64 is okay. I would also try to find out the largest nsegs value after DMA mapping in xmit() (printing it out or using DTrace) so we could try to find the longest chain it tries to send.

Quick and dirty test, iterate over 32, 64, 128, 256 sizes (and none) with and without TSO enabled:

# ./test_size.sh
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  32768     32    20.07      17.34   
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  32768     64    20.00      31.39   
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  32768    128    20.10      50.80   
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  32768    256    20.00      86.87   
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  32768  32768    20.01     432.64   
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  32768     32    20.00      17.06   
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  32768     64    20.09      30.96   
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  32768    128    20.00      50.92   
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  32768    256    20.00      88.69   
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  32768  32768    20.01     266.59

Solid patch now. This ran overnight with my tests and does not lockup
and watchdog anymore.

However ... I still don't know if the TSO code is correct for these adapters.

I see quite clearly with netperf that turning off TSO seems to be *faster*
in the cases of:

netperf -H 192.168.100.1 -p10000 -l20 -- -s64k
netperf -H 192.168.100.1 -p10000 -l20 -- -s32k

In D3192#65302, @erj wrote:

Those numbers don't look abysmal, I think. Try it without TSO, and if the numbers don't noticeably get better, then maybe your new limit of 64 is okay. I would also try to find out the largest nsegs value after DMA mapping in xmit() (printing it out or using DTrace) so we could try to find the longest chain it tries to send.

Tested for 12 hours today. The highest value of nsegs I got from NFS was 34. :-)

Change a couple of variable types to BOOL.

Remove txd_used and just decrement tx_avail in the workaround loop.

Define TSO_WORKAROUND 4 so we know this is a hack, still unclear where this
originates from.

sbruno retitled this revision from TSO debugging to TSO4 Stability Fixes, expand EM_MAXSCATTER.Jul 31 2015, 4:53 PM

sbruno updated this object.

sbruno added reviewers: erj, jhb.

sbruno edited subscribers, added: hiren; removed: erj.

sbruno added subscribers: kbowling, j-nitrology.com.

I am not well-versed with all the details but seems okay at the first glance. @sbruno did good set of tests with @erj and they look positive.

Only comment/concern I have is about wider testing. As in, whether these changes break other em(4) chips (which are not tested) but I'll leave that up to @sbruno and @erj.

ps: till now I've seen many people disable TSO while using em(4) so if this change fixes that for anyone, I consider that a *win*.

This revision is now accepted and ready to land.Jul 31 2015, 6:57 PM

@sbruno: To be clear, did you get any chains larger than 34 using netperf or did you only run dtrace to track chain lengths with NFS?

In D3192#66107, @jhb wrote:

@sbruno: To be clear, did you get any chains larger than 34 using netperf or did you only run dtrace to track chain lengths with NFS?

I only tested NFS. Never actually looked at pushing more with netperf/iperf.

In D3192#66127, @sbruno wrote:

In D3192#66107, @jhb wrote:

@sbruno: To be clear, did you get any chains larger than 34 using netperf or did you only run dtrace to track chain lengths with NFS?

I only tested NFS. Never actually looked at pushing more with netperf/iperf.

Can you re-run some of your netperf tests from earlier with the dtrace instrumentation in place to verify that you are actually transmitting chains larger than 34? For example, erj@ earlier suggested that the limit might be 40 rather than 64 (it's not clear if you got 64 from a chip manual or if it was a SWAG). If you can see a chain above 40 than that would lead credence to your value of 64. If the number is wrong I'd worry that this would just be a device hang someone will trip over in the future for some chain that is less than 64 but beyond what the device can actually handle.

In D3192#67127, @jhb wrote:

In D3192#66127, @sbruno wrote:

In D3192#66107, @jhb wrote:

@sbruno: To be clear, did you get any chains larger than 34 using netperf or did you only run dtrace to track chain lengths with NFS?

I only tested NFS. Never actually looked at pushing more with netperf/iperf.

Can you re-run some of your netperf tests from earlier with the dtrace instrumentation in place to verify that you are actually transmitting chains larger than 34? For example, erj@ earlier suggested that the limit might be 40 rather than 64 (it's not clear if you got 64 from a chip manual or if it was a SWAG). If you can see a chain above 40 than that would lead credence to your value of 64. If the number is wrong I'd worry that this would just be a device hang someone will trip over in the future for some chain that is less than 64 but beyond what the device can actually handle.

I will rerun with netperf as erj@ suggested. However, I *never* saw a lock up outside of the NFS client use case.

The limit information is not documented anywhere and was a SWAG that someone added to lem(4) at some point so I guessed it might be a hack for em(4) as well.

In D3192#67127, @jhb wrote:

In D3192#66127, @sbruno wrote:

In D3192#66107, @jhb wrote:

@sbruno: To be clear, did you get any chains larger than 34 using netperf or did you only run dtrace to track chain lengths with NFS?

I only tested NFS. Never actually looked at pushing more with netperf/iperf.

Can you re-run some of your netperf tests from earlier with the dtrace instrumentation in place to verify that you are actually transmitting chains larger than 34? For example, erj@ earlier suggested that the limit might be 40 rather than 64 (it's not clear if you got 64 from a chip manual or if it was a SWAG). If you can see a chain above 40 than that would lead credence to your value of 64. If the number is wrong I'd worry that this would just be a device hang someone will trip over in the future for some chain that is less than 64 but beyond what the device can actually handle.

NFS writes seems to be the most reliable way of getting a chain of > 32. I never see it go above 5 in tests with iperf/netperf for reasons that are unclear to me.

In D3192#67990, @sbruno wrote:

In D3192#67127, @jhb wrote:

In D3192#66127, @sbruno wrote:

In D3192#66107, @jhb wrote:

@sbruno: To be clear, did you get any chains larger than 34 using netperf or did you only run dtrace to track chain lengths with NFS?

I only tested NFS. Never actually looked at pushing more with netperf/iperf.

Can you re-run some of your netperf tests from earlier with the dtrace instrumentation in place to verify that you are actually transmitting chains larger than 34? For example, erj@ earlier suggested that the limit might be 40 rather than 64 (it's not clear if you got 64 from a chip manual or if it was a SWAG). If you can see a chain above 40 than that would lead credence to your value of 64. If the number is wrong I'd worry that this would just be a device hang someone will trip over in the future for some chain that is less than 64 but beyond what the device can actually handle.

NFS writes seems to be the most reliable way of getting a chain of > 32. I never see it go above 5 in tests with iperf/netperf for reasons that are unclear to me.

This is what I'm doing, btw:

 @@ -2074,6 +2086,11 @@

	i = txr->next_avail_desc;

 +	if (nsegs > adapter->max_nsegs) {
 +		device_printf(adapter->dev, "max_negs is now %d\n", adapter->max_nsegs);
 +		adapter->max_nsegs = nsegs;
 +	}
 +
	/* Set up our transmit descriptors */
	for (j = 0; j < nsegs; j++) {
		bus_size_t seg_len;

When doing this, I only see nsegs hit 34, and it rarely hits 34. I can *only* seem to get this to happen with NFS writes. netperf/iperf do not manifest this.

In D3192#68019, @sbruno wrote:

In D3192#67990, @sbruno wrote:

NFS writes seems to be the most reliable way of getting a chain of > 32. I never see it go above 5 in tests with iperf/netperf for reasons that are unclear to me.

Hmm, I bet plain writes from userland get collapsed via sbcompress() or the like.

I'm not sure then of a way to provoke longer chains easily. If Eric has happy with 64 (vs, say, 40) then I'll defer to his judgement.

Refresh review based on today's head

This revision now requires review to proceed.Aug 16 2015, 7:25 PM

Herald added a reviewer: Intel Networking. · View Herald TranscriptAug 16 2015, 7:25 PM

Closed by commit rS286831: Increase EM_MAX_SCATTER to 64 such that the size of em_xmit()::segs… (authored by sbruno). · Explain WhyAug 16 2015, 7:43 PM

This revision was automatically updated to reflect the committed changes.

TSO4 Stability Fixes, expand EM_MAXSCATTER
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 7445

sys/dev/e1000/if_em.h

sys/dev/e1000/if_em.c

TSO4 Stability Fixes, expand EM_MAXSCATTERClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 7445

sys/dev/e1000/if_em.h

sys/dev/e1000/if_em.c

TSO4 Stability Fixes, expand EM_MAXSCATTER
ClosedPublic
Actions

Revision Contents
Changeset List