Page MenuHomeFreeBSD

Reduce thread count in driver to reduce out of order transmit processing
AbandonedPublic

Authored by sbruno on Feb 25 2016, 1:03 AM.

Details

Summary

em(4) suffers from threads jumping in and out of context.

em_local_timer() is not rearmed enough and will jump in during normal processing of queues and transmit out of order data.

Reduce thread count by consolidating Multi-queue processing to a combined tx/rx model.

Remove restrictions on 82574 for EM_MULTIQUEUE. With needing only 3 MSI-X interrupts, this should allow other cards to be tested. I have not done this.

Do not defer to em_handle_que() taskqueue during transmit processing, ever.

Test Plan

Need further testing, this is initial review.

My test setup was to have a single host with an 82574L connected to a switch on a seperate network connected to my test host.

[netboot/nfs test machine](83574L) <-> [gig-e switch] <-> [lab server](realtek gig-e or intel gig-e, 82574L).

From the lab server I run a netperf server listening on port 10001 (/usr/local/bin/netserver -D -p 10001)

From the test machine, running netperf and checking the value of netstat -s | grep "duplicate ack" will show result in ever increasing values. In this scenario, this should not happen. Even with the patches contained in this ticket or running without EM_MULTIQUEUE enabled, duplicate acks increase.

Example test output:

# netperf -H 192.168.100.1 -n4 -p10001 -l600 -- -s64K ; netstat -s | grep "duplicate ack"
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  65536  65536    600.08    835.92   
                10202 duplicate acks

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 5878
Build 6165: arc lint + arc unit

Event Timeline

sbruno updated this revision to Diff 13705.Feb 25 2016, 1:03 AM
sbruno retitled this revision from to Reduce thread count in driver to reduce out of order transmit processing.
sbruno updated this object.
sbruno edited the test plan for this revision. (Show Details)
sbruno added reviewers: gnn, hiren.
sbruno updated this object.Feb 25 2016, 1:07 AM
sbruno edited edge metadata.
sbruno edited the test plan for this revision. (Show Details)Feb 25 2016, 1:19 PM
sbruno edited the test plan for this revision. (Show Details)
sbruno edited the test plan for this revision. (Show Details)
sbruno edited the test plan for this revision. (Show Details)
sbruno updated this revision to Diff 13731.Feb 25 2016, 4:33 PM
sbruno updated this object.

Fix copy/paste error in em_setup_msix() where it was checking the same condition twice.

Move variable in struct adapter up a line so it doesn't generate a diff.

sbruno updated this revision to Diff 13732.Feb 25 2016, 4:34 PM

Remove whitespace and put the tab back.

jeffrey.e.pieper_intel.com requested changes to this revision.Feb 25 2016, 8:56 PM

This causes a kernel panic on i219, simply by assigning an IP and ifup.

This revision now requires changes to proceed.Feb 25 2016, 8:56 PM
sbruno added inline comments.Feb 25 2016, 11:34 PM
sys/dev/e1000/if_em.c
2594–2595

Derp ... I didn't change this to point to the que->que_task nor did I change the taskqueue creation below. I should do that. :-)

sbruno updated this revision to Diff 13745.Feb 26 2016, 1:14 AM
sbruno edited edge metadata.

Fixup legacy IRQ handling by assigning taskqueues correctly ... geesh.

sbruno updated this revision to Diff 13987.Mar 2 2016, 3:28 PM
sbruno edited edge metadata.

Further consolidate em_handle_que/em_msix_que now that they should always do the
same thing. em_msix_que() updates the irq count for the que and then invokes
em_handle_que, simplifying and reducing code.

sbruno marked an inline comment as done.Mar 3 2016, 2:24 PM
sbruno updated this revision to Diff 14072.Mar 4 2016, 1:40 PM

Remove a couple of debug printf's that made their way into this review.

gnn accepted this revision.Mar 4 2016, 1:47 PM
gnn edited edge metadata.
sbruno updated this revision to Diff 14108.Mar 5 2016, 9:11 PM
sbruno edited edge metadata.

Remove __noinline change to em_xmit() that was inplace for dtrace testing.

erj added a subscriber: erj.Mar 6 2016, 1:14 AM
erj added inline comments.
sys/dev/e1000/if_em.c
101

Do these get used ever?

sys/dev/e1000/if_em.h
3

Copyright is older in the new version!

sbruno updated this revision to Diff 14122.Mar 6 2016, 6:07 PM
sbruno edited edge metadata.

Address review comments.

sbruno marked 2 inline comments as done.Mar 6 2016, 6:08 PM

Updated for comments.

sbruno updated this revision to Diff 14298.Mar 13 2016, 5:38 PM

Iterim update.

  • Fix non EM_MULTIQUEUE behavior on non-MSIX devices (tested on i217 laptop)
  • Add sysctl monitor entries for hash counters.
  • Add sysctl monitor for per queue tx and rx counters

I think Sean should get an email about this.

Sorry to everyone else who also gets an email about this.

This comment was removed by sbruno.
This comment was removed by sbruno.
sbruno updated this revision to Diff 14747.Mar 30 2016, 7:20 PM

More preview release quality code here.

I've mimic'd what jhb keeps trying to do for igb(4) and always defer tx processing
to the taskqueue, this keeps the transmit queue from sending out of order due to this
driver's design.

Changed a check in em_xmit() to avoid an underflow condition that I caught with dtrace.

Add RSS code now while I'm here. This seems to slow down the performance of my little ATOM
box too much, but might be good on a machine that has higher horsepower.

Adrian:

When I added the RSS bits to em(4) performance dropped well below the non-multiqueue case (20-30%). However, my test machine *was* a small Atom server with 4G ram. How much impact should the speed of the CPU have in this type of case?

sbruno added a reviewer: jhb.Mar 30 2016, 7:25 PM

John:

I've stolen the code that you keep trying to get to "stick" to igb(4) to *always* defer transmits to the txq_queue taskqueue. This seems to lock down my out of order bits and duplicate ack issues.

Is there some reason that its not quite the same in igb(4) any more? Does it need to be updated in igb(4) as well?

adrian edited edge metadata.Mar 30 2016, 7:46 PM

Sean,

How many RSS bits are you dropping it to? What's taking more CPU time?

Sean,
How many RSS bits are you dropping it to? What's taking more CPU time?

Hrm ... this question seems to indicate I don't understand something.

What do you mean "RSS bits" ?

sbruno updated this revision to Diff 14932.Apr 6 2016, 2:46 PM
sbruno edited edge metadata.

Fix build on i386

sbruno updated this revision to Diff 18287.Jul 11 2016, 3:28 PM

Update for 12-CURRENT

gnn accepted this revision.Jul 11 2016, 8:23 PM
gnn edited edge metadata.
sbruno updated this revision to Diff 19156.Aug 9 2016, 2:23 PM
sbruno edited edge metadata.

Fixup compile of non-multiqueue case.

adrian added a comment.Aug 9 2016, 4:09 PM

This looks like a couple of patches all put into one patch.

Shall we break out the RSS bits first into a separate patch?

This looks like a couple of patches all put into one patch.
Shall we break out the RSS bits first into a separate patch?

I don't disagree with your request, but I don't know that its super relevant without the multiple queue fixes here.

I really wish that I could "put this review on hold" or something so I can update it and work on it without spamming updates, but I'm kind of using this review as source control for the time being while waiting for my lab to get unpacked/rebuilt.

sbruno updated this revision to Diff 19212.Aug 11 2016, 11:21 PM
sbruno edited edge metadata.

em_allocate_msix():

  • wrap EM_MULTIQUEUE specific taskqueue initialization in the proper #ifdef
sbruno updated this revision to Diff 20767.Sep 28 2016, 1:19 PM

Trying MTX shenanigans with this review.

Assume that, in the tx path, if you are unable to acquire the TX_LOCK, its ok to just
exit (inspired by comments and code from npn).

Assume that, in the interrupt path (em_handle_que), that if you are unable to acquire the
TX_LOCK or RX_LOCK, its ok to exit.

Assume that, if the TX taskqueue has a pending thread enqueued *after* we enqueue our xmit
data, its ok to exit and allow it to be sent by the already pending thread so as to not
stack up taskqueue threads that don't really need to be instantiated.

Allow the default interrupt rate to be 1/2 of its normal value in the multiqueue case. This
seems to lower the amount of noise I see in my data and makes the machine operate a bit smoother.

Thoughts on this nonsense are welcome.

sbruno updated this revision to Diff 21778.Oct 28 2016, 5:45 PM

Cleanup some RX_LOCK handling.

sbruno updated this revision to Diff 21943.Nov 2 2016, 2:41 PM

When configuring the MRQC, don't blindly overwrite defaults values.

sbruno updated this revision to Diff 21944.Nov 2 2016, 2:43 PM

Commit correct version of MRQC configuration commands.

sbruno updated this revision to Diff 22042.Nov 7 2016, 3:52 AM

Rename em_local_timer to em_watchdog.

Add a seperate stats collecting thread now that the watchdog thread isn't firing
constantly.

Update the stats reported to the network stack.

Set CTRL_EXT LS and LS_FLOW as per the 82574 technical document (10.2.25).

sbruno updated this revision to Diff 22064.Nov 7 2016, 9:54 PM

Iterate over queues in netmap's unblock tasks function.

sepherosa_gmail.com added inline comments.
sys/dev/e1000/if_em.c
1066–1069

I am not quite sure about the comment; since the packet order is actually maintained by the drbr. The code here is ok; and the usage of ta_pending is interesting ;). BTW, do we need compiler barrier here at least, mainly to prevent ta_pending got loaded before any instructions in drbr_enqueue().

4921

This should be M_HASHTYPE_OPAQUE_HASH on the current.

The try lock scheme you have used will cause two types of issues:

  • Packets would sit on drbr forever, if the following situation happened:
    1. Deferred start's trylock succeeds, but no more TX descs available, it just does txeof, assume lock is not released yet.
    2. When taskqueue is doing 1), interrupt for TXEOF happens, but its trylock fails, so neither txeof nor start will be called.
    3. Lock is released by the deferred start. It actually no longer matters...
    4. If no more packets are to be sent, then all pending packet on drbr will be there forever.

This probably could be fixed by moving txeof to the beginning of the deferred start.

  • Again the packets would sit on drbr forever, if the following situation happened:
    1. Interrupt is coming, the txeof and start are completed, but lock is not released yet.
    2. A set of packets are enqueued to drbr. Then schedule the deferred start task.
    3. The task runs, but failed to get the lock and it is done.
    4. The lock is released by the interrupt handling. Well, it actually no longer matters...
    5. If no more packets are to be sent, then all pending packet on drbr will be there forever.

This is the issue of the trylock scheme itself. One way to fix it probably could be: using interrupt filter for msi-x handler, and scheduling the real msi-x handling to the same taskqueue as the deferred start; according to my experience, it may reduce performance for latency sensitive applications, if you have >2(# of queues) CPUs for 82574 here.

sbruno abandoned this revision.Jan 5 2017, 11:21 PM

This is being superceded by the iflib implementation of e1000.