Page MenuHomeFreeBSD

tcp: fix erroneous transmission selection after RTO w/ SACK incoming

Authored by rscheff on Jan 7 2024, 4:24 PM.
Referenced Files
Unknown Object (File)
Fri, Jun 14, 8:27 AM
Unknown Object (File)
Sun, Jun 9, 9:53 PM
Unknown Object (File)
Thu, Jun 6, 4:23 AM
Unknown Object (File)
Tue, May 28, 11:08 AM
Unknown Object (File)
May 18 2024, 6:20 PM
Unknown Object (File)
May 14 2024, 4:53 PM
Unknown Object (File)
May 10 2024, 6:21 AM
Unknown Object (File)
May 8 2024, 9:37 PM



A subtle bug was present in the base TCP stack, virtually since the
inception of SACK loss recovery. However, for a number of reasons
it doesn't show up easily.

Under normal circumstances, when SACK loss recovery is initiated,
snd_nxt and snd_max track each other. Only during non-SACK loss
recovery, snd_nxt deviates from snd_max.

On RTO, snd_nxt is reset to snd_una. Prior to RFC6675 SACK loss
recovery initialization, SACK loss recovery wouldn't have been
initiated right away with the first incoming SACK ACK.

In addition, cwnd used to be 1 MSS right after RTO, increasing
to 2 MSS more recently.

Furthermore, TSO/LRO typically deliver one ACK covering two or
more segments, thus masking the issue of tcp_output alternating
between retransmitting from snd_nxt (without SACK, thus not
dragging hole->rxmit right), followed by retransmitting
from the SACK hole - resulting in the same data sent twice
until a full ACK without SACK (and an empty scoreboard) is reached.

Address this by setting up snd_recover just in cc_cong_signal.

MFC after: 1 week

Test Plan

Disable TSO & LRO on sender and receiver:

ifconfig <if> -lro -tso

Disable behavior like LRD (Lost Retransmission Detection),
PRR (Proportional Rate Reduction) to demonstrate the issue similar to
what exists in older FreeBSD versions:

sysctl net.inet.tcp.do_prr=0
sysctl net.inet.tcp.do_lrd=0 (or net.inet.tcp.sack.lrd=0)

Start an iperf3 server in the background on the receiver:

iperf3 -s

Start a script to use ipfw to induce sudden, and longer loss periods:

while [ 1 ]; do

  1. induce a little loss for ~10-20 ms, to induce SACK loss recovery ipfw add 10 deny tcp from any to any 5201; ipfw delete 10;
  2. induce a longer loss period ~100ms, to trigger RTO ipfw add 10 deny tcp from any to any 5201; sleep 0.1; ipfw delete 10;
  3. allow all packets for 4000 ms to build up the congestion window sleep 4; done

Finally start a tcpdump on the sender:

tcpdump -i <if> -w <file> -s 128 tcp and port 5201 &

and start traffic:

iperf3 -c <client> -t 45

Finally, stop the trace and inspect it for occurances of SACK blocks
and that every ACK during the loss episode is covering at most one
MSS (typically ~1500 bytes). After the RTO with SACK, it is normal
that cwnd remains at 2 MSS, but the presence of DSACK blocks, or
segments covering the same data twice demonstrates this issue.

Diff Detail

rG FreeBSD src repository
Lint Passed
No Test Coverage
Build Status
Buildable 55284
Build 52173: arc lint + arc unit

Event Timeline

  • set snd_recover to snd_fack on RTO
  • pull snd_nxt right during SACK after RTO
  • init recover to fack on RTO
  • pull snd_nxt right during SACK after RTO
  • only enter FastRecovery per RFC6675 when on the right edge
  • prevent retransmitting of covered SACK holes when FR activates as snd_nxt closes with snd_max
  • prevent sack-resending at nxt==max. Will still enter FR (and CC reaction)
  • Revert previous commit, as it can lead to SACK accounting panics
  • pull snd_nxt forward when transmitting from SACK hole
  • only enter FR by 6675 when at the right edge, and when after RTO, only when new SACK holes show up
  • update sack_adjust comments

With this patch, the classic (pre-SACK) loss-recovery "indication" of snd_nxt < snd_max as well as the IN_FASTRECOVERY() flags will follow the SACK transmission selection path, if appropriate. The difference being that during IN_FASTRECOVERY, the per-ACK CC reaction (cwnd adjustment) is disabled - except for PRR - while in the case of (snd_nxt < snd_max), the CC controls the congestion window (e.g. uses slow start after the RTO until reaching ssthresh, then using congestion avoidance).

tcp_sack_adjust will also inflate cwnd (what normally - IN_FASTRECOVERY - happens when calculating the pipe).

In addition, the RFC6675 trigger to enter Fast Recovery (an ACK with a SACK option covering at least 2*MSS+1 bytes) is only active while snd_nxt is at the right edge, and requires additional indications of new losses right after a Retransmission Timeout.

In order for normal SACK processing to work correctly, after successfully transmitting a packet, snd_nxt has to be dragged along - eventually becoming equal to snd_max again, thus the additional checks in the RFC6675 FR trigger.

  • when retransmitting after RTO, don't inflate cnwd to nxt-una+mss, but only cwnd+mss
  • when retransmitting from (nxt<max), use cwnd only
  • simplify 6675 FR trigger condition
This revision is now accepted and ready to land.Apr 4 2024, 12:45 PM