Page MenuHomeFreeBSD

Reduce thread count in driver to reduce out of order transmit processing
AbandonedPublic

Authored by sbruno on Feb 25 2016, 1:03 AM.

Details

Summary

em(4) suffers from threads jumping in and out of context.

em_local_timer() is not rearmed enough and will jump in during normal processing of queues and transmit out of order data.

Reduce thread count by consolidating Multi-queue processing to a combined tx/rx model.

Remove restrictions on 82574 for EM_MULTIQUEUE. With needing only 3 MSI-X interrupts, this should allow other cards to be tested. I have not done this.

Do not defer to em_handle_que() taskqueue during transmit processing, ever.

Test Plan

Need further testing, this is initial review.

My test setup was to have a single host with an 82574L connected to a switch on a seperate network connected to my test host.

[netboot/nfs test machine](83574L) <-> [gig-e switch] <-> [lab server](realtek gig-e or intel gig-e, 82574L).

From the lab server I run a netperf server listening on port 10001 (/usr/local/bin/netserver -D -p 10001)

From the test machine, running netperf and checking the value of netstat -s | grep "duplicate ack" will show result in ever increasing values. In this scenario, this should not happen. Even with the patches contained in this ticket or running without EM_MULTIQUEUE enabled, duplicate acks increase.

Example test output:

# netperf -H 192.168.100.1 -n4 -p10001 -l600 -- -s64K ; netstat -s | grep "duplicate ack"
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.100.1 () port 0 AF_INET : histogram : interval : dirty data : demo
Recv   Send    Send                          
Socket Socket  Message  Elapsed              
Size   Size    Size     Time     Throughput  
bytes  bytes   bytes    secs.    10^6bits/sec  

 65536  65536  65536    600.08    835.92   
                10202 duplicate acks

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 5878
Build 6165: arc lint + arc unit

Event Timeline

sbruno retitled this revision from to Reduce thread count in driver to reduce out of order transmit processing.
sbruno updated this object.
sbruno edited the test plan for this revision. (Show Details)
sbruno added reviewers: gnn, hiren.
sbruno edited the test plan for this revision. (Show Details)
sbruno edited the test plan for this revision. (Show Details)
sbruno edited the test plan for this revision. (Show Details)
sbruno updated this object.

Fix copy/paste error in em_setup_msix() where it was checking the same condition twice.

Move variable in struct adapter up a line so it doesn't generate a diff.

Remove whitespace and put the tab back.

This causes a kernel panic on i219, simply by assigning an IP and ifup.

This revision now requires changes to proceed.Feb 25 2016, 8:56 PM
sys/dev/e1000/if_em.c
2594–2595

Derp ... I didn't change this to point to the que->que_task nor did I change the taskqueue creation below. I should do that. :-)

sbruno edited edge metadata.

Fixup legacy IRQ handling by assigning taskqueues correctly ... geesh.

sbruno edited edge metadata.

Further consolidate em_handle_que/em_msix_que now that they should always do the
same thing. em_msix_que() updates the irq count for the que and then invokes
em_handle_que, simplifying and reducing code.

Remove a couple of debug printf's that made their way into this review.

gnn edited edge metadata.
sbruno edited edge metadata.

Remove __noinline change to em_xmit() that was inplace for dtrace testing.

erj added inline comments.
sys/dev/e1000/if_em.c
100

Do these get used ever?

sys/dev/e1000/if_em.h
3

Copyright is older in the new version!

sbruno edited edge metadata.

Address review comments.

Updated for comments.

Iterim update.

  • Fix non EM_MULTIQUEUE behavior on non-MSIX devices (tested on i217 laptop)
  • Add sysctl monitor entries for hash counters.
  • Add sysctl monitor for per queue tx and rx counters

I think Sean should get an email about this.

Sorry to everyone else who also gets an email about this.

This comment was removed by sbruno.
This comment was removed by sbruno.

More preview release quality code here.

I've mimic'd what jhb keeps trying to do for igb(4) and always defer tx processing
to the taskqueue, this keeps the transmit queue from sending out of order due to this
driver's design.

Changed a check in em_xmit() to avoid an underflow condition that I caught with dtrace.

Add RSS code now while I'm here. This seems to slow down the performance of my little ATOM
box too much, but might be good on a machine that has higher horsepower.

Adrian:

When I added the RSS bits to em(4) performance dropped well below the non-multiqueue case (20-30%). However, my test machine *was* a small Atom server with 4G ram. How much impact should the speed of the CPU have in this type of case?

John:

I've stolen the code that you keep trying to get to "stick" to igb(4) to *always* defer transmits to the txq_queue taskqueue. This seems to lock down my out of order bits and duplicate ack issues.

Is there some reason that its not quite the same in igb(4) any more? Does it need to be updated in igb(4) as well?

Sean,

How many RSS bits are you dropping it to? What's taking more CPU time?

Sean,

How many RSS bits are you dropping it to? What's taking more CPU time?

Hrm ... this question seems to indicate I don't understand something.

What do you mean "RSS bits" ?

sbruno edited edge metadata.

Fix build on i386

gnn edited edge metadata.
sbruno edited edge metadata.

Fixup compile of non-multiqueue case.

This looks like a couple of patches all put into one patch.

Shall we break out the RSS bits first into a separate patch?

This looks like a couple of patches all put into one patch.

Shall we break out the RSS bits first into a separate patch?

I don't disagree with your request, but I don't know that its super relevant without the multiple queue fixes here.

I really wish that I could "put this review on hold" or something so I can update it and work on it without spamming updates, but I'm kind of using this review as source control for the time being while waiting for my lab to get unpacked/rebuilt.

sbruno edited edge metadata.

em_allocate_msix():

  • wrap EM_MULTIQUEUE specific taskqueue initialization in the proper #ifdef

Trying MTX shenanigans with this review.

Assume that, in the tx path, if you are unable to acquire the TX_LOCK, its ok to just
exit (inspired by comments and code from npn).

Assume that, in the interrupt path (em_handle_que), that if you are unable to acquire the
TX_LOCK or RX_LOCK, its ok to exit.

Assume that, if the TX taskqueue has a pending thread enqueued *after* we enqueue our xmit
data, its ok to exit and allow it to be sent by the already pending thread so as to not
stack up taskqueue threads that don't really need to be instantiated.

Allow the default interrupt rate to be 1/2 of its normal value in the multiqueue case. This
seems to lower the amount of noise I see in my data and makes the machine operate a bit smoother.

Thoughts on this nonsense are welcome.

Cleanup some RX_LOCK handling.

When configuring the MRQC, don't blindly overwrite defaults values.

Commit correct version of MRQC configuration commands.

Rename em_local_timer to em_watchdog.

Add a seperate stats collecting thread now that the watchdog thread isn't firing
constantly.

Update the stats reported to the network stack.

Set CTRL_EXT LS and LS_FLOW as per the 82574 technical document (10.2.25).

Iterate over queues in netmap's unblock tasks function.

sepherosa_gmail.com added inline comments.
sys/dev/e1000/if_em.c
1066–1069

I am not quite sure about the comment; since the packet order is actually maintained by the drbr. The code here is ok; and the usage of ta_pending is interesting ;). BTW, do we need compiler barrier here at least, mainly to prevent ta_pending got loaded before any instructions in drbr_enqueue().

4921

This should be M_HASHTYPE_OPAQUE_HASH on the current.

The try lock scheme you have used will cause two types of issues:

  • Packets would sit on drbr forever, if the following situation happened:
    1. Deferred start's trylock succeeds, but no more TX descs available, it just does txeof, assume lock is not released yet.
    2. When taskqueue is doing 1), interrupt for TXEOF happens, but its trylock fails, so neither txeof nor start will be called.
    3. Lock is released by the deferred start. It actually no longer matters...
    4. If no more packets are to be sent, then all pending packet on drbr will be there forever.

This probably could be fixed by moving txeof to the beginning of the deferred start.

  • Again the packets would sit on drbr forever, if the following situation happened:
    1. Interrupt is coming, the txeof and start are completed, but lock is not released yet.
    2. A set of packets are enqueued to drbr. Then schedule the deferred start task.
    3. The task runs, but failed to get the lock and it is done.
    4. The lock is released by the interrupt handling. Well, it actually no longer matters...
    5. If no more packets are to be sent, then all pending packet on drbr will be there forever.

This is the issue of the trylock scheme itself. One way to fix it probably could be: using interrupt filter for msi-x handler, and scheduling the real msi-x handling to the same taskqueue as the deferred start; according to my experience, it may reduce performance for latency sensitive applications, if you have >2(# of queues) CPUs for 82574 here.

This is being superceded by the iflib implementation of e1000.