Background:
A user has an ongoing problem in 13.1 wherein packets occasionally
appear to get "stuck" in between two epair interface for hundreds of
milliseconds. This trips responsiveness checks in a system which
requires low latency, and tcpdump was used to verify that an epair is
the culprit.
thj@ and dch@ were able to reproduce the problem with a loop that uses
nc(1) to connect to nginx running in a jail, execute a short GET
request, then terminate the connection. It occasionally takes several
hundred milliseconds for the TCP connection to establish. This is on an
otherwise idle 32-core Epyc system; we're nowhere close to saturating
any hardware resources.
A dtrace script which measures the time elapsed between sched:::on-cpu
and sched:::off-cpu probes for the epair task thread shows the following
distribution (the magnitude of the tail is worse on 13.1 than on main):
value ------------- Distribution ------------- count
256 | 0
512 | 8586
1024 |@ 22289
2048 |@@ 74280
4096 |@@@@@@@@@@@ 427404
8192 |@@@@@@@@@@@@ 454310
16384 |@@@@@ 182130
32768 | 10542
65536 | 16049
131072 |@ 29988
262144 |@@ 57646
524288 |@@@@@@ 226848
1048576 | 43
2097152 | 0
4194304 | 0
8388608 | 1
16777216 | 0
33554432 | 60 <-- waiting for work for over 33ms
67108864 | 1
134217728 | 0Description:
epair_transmit() does little other than to hand an mbuf chain to a
taskqueue thread. Commit 3dd5760aa5f8 ("if_epair: rework") changed the
handoff to use a pair of lockless buf_ring queues. I believe the idea
there is to try and improve scalability by having producers insert into
one ring while the consumer works on the other. Commit
66acf7685bcd ("if_epair: fix race condition on multi-core systems")
fixed a bug in this scheme which led to lost wakeups, by adding an extra
pair of flags.
I believe that transmitters can fail to wake up the taskqueue thread
even with the bug fix. In particular, it's possible for the queue ridx
to flip twice after a transmitter has set BIT_MBUF_QUEUED and loaded the
current ridx, which I believe can lead to stalled transmission since
epair_tx_start_deferred() only checks one queue at a time. It is also
hard to see whether this scheme is correct on platforms weakly-ordered
memory operations, i.e., on arm64.
The transmit path also seems rather expensive: each thread has to
execute at least three atomic instructions per packet.
Rather than complicating the transmit path further, deal with this by
using a mbufq and a mutex. The taskqueue thread can dequeue all pending
packets in an O(1) operation, and a simple state machine lets
transmitters avoid waking up the taskqueue thread more often than
necessary.
This yields a much nicer latency distribution:
value ------------- Distribution ------------- count
256 | 0
512 | 4484
1024 | 50911
2048 |@@ 204025
4096 |@@@@@@@@@@ 1351324
8192 |@@@@@@@@@@@@@ 1711604
16384 |@@@@@@ 744126
32768 | 40140
65536 | 51524
131072 |@ 82192
262144 |@ 153183
524288 |@@@@@@@ 896605
1048576 | 5
2097152 | 0I did some sanity testing with single-stream iperf3 and don't see any
throughput degradation.