Page MenuHomeFreeBSD

tcp: fix erroneous transmission selection after RTO w/ SACK incoming
Needs ReviewPublic

Authored by rscheff on Jan 7 2024, 4:24 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Oct 5, 3:55 AM
Unknown Object (File)
Wed, Oct 2, 6:38 PM
Unknown Object (File)
Wed, Oct 2, 5:43 PM
Unknown Object (File)
Tue, Sep 24, 1:07 AM
Unknown Object (File)
Sat, Sep 21, 3:22 AM
Unknown Object (File)
Thu, Sep 19, 11:28 AM
Unknown Object (File)
Thu, Sep 19, 11:20 AM
Unknown Object (File)
Wed, Sep 18, 11:32 PM
Subscribers

Details

Summary

A subtle bug was present in the base TCP stack, virtually since the
inception of SACK loss recovery. However, for a number of reasons
it doesn't show up easily.

Under normal circumstances, when SACK loss recovery is initiated,
snd_nxt and snd_max track each other. Only during non-SACK loss
recovery, snd_nxt deviates from snd_max.

On RTO, snd_nxt is reset to snd_una. Prior to RFC6675 SACK loss
recovery initialization, SACK loss recovery wouldn't have been
initiated right away with the first incoming SACK ACK.

In addition, cwnd used to be 1 MSS right after RTO, increasing
to 2 MSS more recently.

Furthermore, TSO/LRO typically deliver one ACK covering two or
more segments, thus masking the issue of tcp_output alternating
between retransmitting from snd_nxt (without SACK, thus not
dragging hole->rxmit right), followed by retransmitting
from the SACK hole - resulting in the same data sent twice
until a full ACK without SACK (and an empty scoreboard) is reached.

Address this by setting up snd_recover just in cc_cong_signal.

MFC after: 1 week

Test Plan

Disable TSO & LRO on sender and receiver:

ifconfig <if> -lro -tso

Disable behavior like LRD (Lost Retransmission Detection),
PRR (Proportional Rate Reduction) to demonstrate the issue similar to
what exists in older FreeBSD versions:

sysctl net.inet.tcp.do_prr=0
sysctl net.inet.tcp.do_lrd=0 (or net.inet.tcp.sack.lrd=0)

Start an iperf3 server in the background on the receiver:

iperf3 -s

Start a script to use ipfw to induce sudden, and longer loss periods:

while [ 1 ]; do

  1. induce a little loss for ~10-20 ms, to induce SACK loss recovery ipfw add 10 deny tcp from any to any 5201; ipfw delete 10;
  2. induce a longer loss period ~100ms, to trigger RTO ipfw add 10 deny tcp from any to any 5201; sleep 0.1; ipfw delete 10;
  3. allow all packets for 4000 ms to build up the congestion window sleep 4; done

Finally start a tcpdump on the sender:

tcpdump -i <if> -w <file> -s 128 tcp and port 5201 &

and start traffic:

iperf3 -c <client> -t 45

Finally, stop the trace and inspect it for occurances of SACK blocks
and that every ACK during the loss episode is covering at most one
MSS (typically ~1500 bytes). After the RTO with SACK, it is normal
that cwnd remains at 2 MSS, but the presence of DSACK blocks, or
segments covering the same data twice demonstrates this issue.

packetdrill demonstration of packets getting sent twice:

#cat RTO-while-SACK-LR.pkt
//
// A simple test to verify behavior when
// during SACK LR, an RTO happens - and packets
// are sent twice - once from the sackhole, once
// from the pulled back snd_nxt - D43355
//
--tolerance_usecs=500000

0.0 `
ifconfig tun0 tso
sysctl net.inet.tcp.sack.lrd=0
sysctl net.inet.tcp.do_prr=0
sysctl net.inet.tcp.sack.revised=1
sysctl net.inet.tcp.sack.enable=1
sysctl net.inet.tcp.cc.algorithm=cubic
sysctl net.inet.tcp.hostcache.purgenow=1
ifconfig
ipfw add 200 allow ip from any to any xmit tun0
#tcpdump -i tun0 -w SACK-RTO.trc -p &
sleep 0.05
`
// Create a listening TCP socket.

+0.001 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.001 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.001 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
+0.001 setsockopt(3, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
+0.005 bind(3, ..., ...) = 0
+0.005 listen(3, 1) = 0

// Establish a TCP connection.
+0.035 < S  0:0(0) win 65535 <mss 1000, sackOK, wscale 9, nop, nop, nop >
+0.000 > S. 0:0(0) ack 1 win 65535 <...>
+0.000 <  . 1:1(0) ack 1 win 65535
+0.000 accept(3, ..., ...) = 4


+0.1 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
0.200  write(4, ..., 100000) = 100000

+0.0     >  .     1:1001(1000) ack 1
+0.0     >  .  1001:2001(1000) ack 1
+0.0     >  .  2001:3001(1000) ack 1
+0.0     >  .  3001:4001(1000) ack 1
+0.0     >  .  4001:5001(1000) ack 1
+0.0     >  .  5001:6001(1000) ack 1
+0.0     >  .  6001:7001(1000) ack 1
+0.0     >  .  7001:8001(1000) ack 1
+0.0     >  .  8001:9001(1000) ack 1
+0.0     >  .  9001:10001(1000) ack 1
+0.001     <  .     1:1(0) ack 1001 win 65535  // ACKing every packet
+0.001     <  .     1:1(0) ack 2001 win 65535  // to built up cwnd fast
+0.001     <  .     1:1(0) ack 3001 win 65535
+0.001     <  .     1:1(0) ack 4001 win 65535
+0.001     <  .     1:1(0) ack 5001 win 65535
+0.001     <  .     1:1(0) ack 6001 win 65535
+0.001     <  .     1:1(0) ack 7001 win 65535
+0.001     <  .     1:1(0) ack 8001 win 65535
+0.001     <  .     1:1(0) ack 9001 win 65535
+0.001     <  .     1:1(0) ack 10001 win 65535
+0      >  . 10001:11001(1000) ack 1
+0      >  . 11001:12001(1000) ack 1
+0      >  . 12001:13001(1000) ack 1
+0      >  . 13001:14001(1000) ack 1
+0      >  . 14001:15001(1000) ack 1
+0      >  . 15001:16001(1000) ack 1
+0      >  . 16001:17001(1000) ack 1
+0      >  . 17001:18001(1000) ack 1
+0      >  . 18001:19001(1000) ack 1
+0      >  . 19001:20001(1000) ack 1
+0.001       <  .  1:1(0) ack 11001 win 65535
+0.001       <  .  1:1(0) ack 12001 win 65535
+0.001       <  .  1:1(0) ack 13001 win 65535
+0.001       <  .  1:1(0) ack 14001 win 65535
+0.001       <  .  1:1(0) ack 15001 win 65535
+0.001       <  .  1:1(0) ack 16001 win 65535
+0.001       <  .  1:1(0) ack 17001 win 65535
+0.001       <  .  1:1(0) ack 18001 win 65535
+0.001       <  .  1:1(0) ack 19001 win 65535
+0.001       <  .  1:1(0) ack 20001 win 65535
+0     >  . 20001:21001(1000) ack 1
+0     >  . 21001:22001(1000) ack 1
+0     >  . 22001:23001(1000) ack 1
+0     >  . 23001:24001(1000) ack 1
+0     >  . 24001:25001(1000) ack 1
+0     >  . 25001:26001(1000) ack 1
+0     >  . 26001:27001(1000) ack 1
+0     >  . 27001:28001(1000) ack 1
+0     >  . 28001:29001(1000) ack 1
+0     >  . 29001:30001(1000) ack 1
+0.001       <  .  1:1(0) ack 21001 win 65535
+0.001       <  .  1:1(0) ack 22001 win 65535
+0.001       <  .  1:1(0) ack 23001 win 65535
+0.001       <  .  1:1(0) ack 24001 win 65535
+0.001       <  .  1:1(0) ack 25001 win 65535
+0.001       <  .  1:1(0) ack 26001 win 65535
+0.001       <  .  1:1(0) ack 27001 win 65535
+0.001       <  .  1:1(0) ack 28001 win 65535
+0.001       <  .  1:1(0) ack 29001 win 65535
+0.001       <  .  1:1(0) ack 30001 win 65535
+0     >  . 30001:31001(1000) ack 1  // we "drop" the next 11 packets
+0     >  . 31001:32001(1000) ack 1
+0     >  . 32001:33001(1000) ack 1
+0     >  . 33001:34001(1000) ack 1
+0     >  . 34001:35001(1000) ack 1
+0     >  . 35001:36001(1000) ack 1
+0     >  . 36001:37001(1000) ack 1
+0     >  . 37001:38001(1000) ack 1
+0     >  . 38001:39001(1000) ack 1
+0     >  . 39001:40001(1000) ack 1

+0     >  . 40001:41001(1000) ack 1
+0     >  . 41001:42001(1000) ack 1 // and the next 9
+0     >  . 42001:43001(1000) ack 1
+0     >  . 43001:44001(1000) ack 1
+0     >  . 44001:45001(1000) ack 1
+0     >  . 45001:46001(1000) ack 1
+0     >  . 46001:47001(1000) ack 1
+0     >  . 47001:48001(1000) ack 1
+0     >  . 48001:49001(1000) ack 1
+0     >  . 49001:50001(1000) ack 1

+0.001   <   . 1:1(0) ack 30001 win 65535 <sack 40001:41001, nop, nop>

+0     >  . 50001:51001(1000) ack 1
+0     >  . 51001:52001(1000) ack 1 // and the next 9
+0     >  . 52001:53001(1000) ack 1
+0     >  . 53001:54001(1000) ack 1
+0     >  . 54001:55001(1000) ack 1
+0     >  . 55001:56001(1000) ack 1
+0     >  . 56001:57001(1000) ack 1
+0     >  . 57001:58001(1000) ack 1
+0     >  . 58001:59001(1000) ack 1
+0     >  . 59001:60001(1000) ack 1

+0.001   <   . 1:1(0) ack 30001 win 65535 <sack 50001:51001 40001:41001, nop, nop>

+0     >  . 60001:61001(1000) ack 1
+0     >  . 61001:62001(1000) ack 1 // and the next 9
+0     >  . 62001:63001(1000) ack 1
+0     >  . 63001:64001(1000) ack 1
+0     >  . 64001:65001(1000) ack 1
+0     >  . 65001:66001(1000) ack 1
+0     >  . 66001:67001(1000) ack 1
+0     >  . 67001:68001(1000) ack 1
+0     >  . 68001:69001(1000) ack 1
+0     >  . 69001:70001(1000) ack 1

+0     >  . 70001:71001(1000) ack 1
+0     >  . 71001:72001(1000) ack 1

+0.001   <   . 1:1(0) ack 30001 win 65535 <sack 60001:61001, nop, nop>

+0     >  . 30001:31001(1000) ack 1   // we expect the retransmission around here

+0.001   <   . 1:1(0) ack 30001 win 65535 <sack 60001:62001, nop, nop>
+0.001   <   . 1:1(0) ack 30001 win 65535 <sack 60001:63001, nop, nop>
+0     >  . 31001:32001(1000) ack 1

+0.23~+0.25  >  . 30001:31001(1000) ack 1
+0.46~+0.50  >  . 30001:31001(1000) ack 1

+0.001   <   . 1:1(0) ack 30001 win 65535 <sack 60001:64001, nop, nop>
+0     >  . 30001:31001(1000) ack 1

+0.001   <   . 1:1(0) ack 31001 win 65535 <sack 60001:65001, nop, nop>

+0     >  . 31001:32001(1000) ack 1

+0.001   <   . 1:1(0) ack 32001 win 65535 <sack 60001:66001, nop, nop>
+0     >  . 32001:33001(1000) ack 1
+0     >  . 32001:33001(1000) ack 1  // unexpected double transmission

+0.001   <   . 1:1(0) ack 33001 win 65535 <sack 60001:67001, nop, nop>
+0     >  . 33001:34001(1000) ack 1
+0     >  . 33001:34001(1000) ack 1 // unexpected double transmission

+0.001   <   . 1:1(0) ack 34001 win 65535 <sack 60001:67001, nop, nop>
+0     >  . 34001:35001(1000) ack 1
+0     >  . 34001:35001(1000) ack 1 // unexpected double transmission

+0.001   <   . 1:1(0) ack 35001 win 65535 <sack 60001:67001, nop, nop>
+0     >  . 35001:36001(1000) ack 1
+0     >  . 35001:36001(1000) ack 1 // unexpected double transmission

// Reproducer stops here, remainder is copied over stuff

+0     >  . 31001:32001(1000) ack 1
+0     >  . 32001:33001(1000) ack 1

+0.001   <   . 1:1(0) ack 32001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 33001:34001(1000) ack 1
+0     >  . 34001:35001(1000) ack 1

+0.001   <   . 1:1(0) ack 33001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 35001:36001(1000) ack 1
+0     >  . 36001:37001(1000) ack 1

+0.001   <   . 1:1(0) ack 34001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 37001:38001(1000) ack 1
+0     >  . 38001:39001(1000) ack 1

+0.001   <   . 1:1(0) ack 35001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 39001:40001(1000) ack 1
+0     > .  41001:42001(1000) ack 1

+0.001   <   . 1:1(0) ack 36001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     > .  42001:43001(1000) ack 1
+0     > .  43001:44001(1000) ack 1

+0.001   <   . 1:1(0) ack 37001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 44001:45001(1000) ack 1
+0     >  . 45001:46001(1000) ack 1

+0.001   <   . 1:1(0) ack 38001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 46001:47001(1000) ack 1
+0     >  . 47001:48001(1000) ack 1

+0.001   <   . 1:1(0) ack 39001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 48001:49001(1000) ack 1
+0     >  . 49001:50001(1000) ack 1

+0.001   <   . 1:1(0) ack 41001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 51001:52001(1000) ack 1
+0     >  . 52001:53001(1000) ack 1

+0.001   <   . 1:1(0) ack 43001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 53001:54001(1000) ack 1
+0     >  . 54001:55001(1000) ack 1
+0     >  . 55001:56001(1000) ack 1

+0.001   <   . 1:1(0) ack 44001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 56001:57001(1000) ack 1
+0     >  . 57001:58001(1000) ack 1

+0.001   <   . 1:1(0) ack 45001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 58001:59001(1000) ack 1
+0     >  . 59001:60001(1000) ack 1

+0.001   <   . 1:1(0) ack 46001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 64001:65001(1000) ack 1
+0     >  . 65001:66001(1000) ack 1

+0.001   <   . 1:1(0) ack 47001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 66001:67001(1000) ack 1
+0     >  . 67001:68001(1000) ack 1

+0.001   <   . 1:1(0) ack 48001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 68001:69001(1000) ack 1
+0     >  . 69001:70001(1000) ack 1

+0.001   <   . 1:1(0) ack 49001 win 65535 <sack 60001:64001 50001:51001, nop, nop>
+0     >  . 70001:71001(1000) ack 1
+0     >  . 71001:72001(1000) ack 1

+0.001   <   . 1:1(0) ack 51001 win 65535 <sack 60001:64001, nop, nop>
+0.001   <   . 1:1(0) ack 52001 win 65535 <sack 60001:64001, nop, nop>
+0     >  . 72001:73001(1000) ack 1

+0.001   <   . 1:1(0) ack 53001 win 65535 <sack 60001:64001, nop, nop>
+0     >  . 73001:74001(1000) ack 1
+0     >  . 74001:75001(1000) ack 1

+0.001   <   . 1:1(0) ack 54001 win 65535 <sack 60001:64001, nop, nop>
+0     >  . 75001:76001(1000) ack 1
+0     >  . 76001:77001(1000) ack 1

+0.001   <   . 1:1(0) ack 55001 win 65535 <sack 60001:64001, nop, nop>
+0     >  . 77001:78001(1000) ack 1
+0     >  . 78001:79001(1000) ack 1

+0.001   <   . 1:1(0) ack 56001 win 65535 <sack 60001:64001, nop, nop>
+0     >  . 79001:80001(1000) ack 1
+0     >  . 80001:81001(1000) ack 1

+0.001   <   . 1:1(0) ack 57001 win 65535 <sack 60001:64001, nop, nop>
+0     >  . 81001:82001(1000) ack 1
+0     >  . 82001:83001(1000) ack 1

+0.001   <   . 1:1(0) ack 58001 win 65535 <sack 60001:64001, nop, nop>
+0     >  . 83001:84001(1000) ack 1
+0     >  . 84001:85001(1000) ack 1

+0.001   <   . 1:1(0) ack 59001 win 65535 <sack 60001:64001, nop, nop>
+0     >  . 85001:86001(1000) ack 1
+0     >  . 86001:87001(1000) ack 1

+0.001   <   . 1:1(0) ack 64001 win 65535
+0     >  . 87001:88001(1000) ack 1
+0     >  . 88001:89001(1000) ack 1
+0     >  . 89001:90001(1000) ack 1
+0     >  . 90001:91001(1000) ack 1
+0     >  . 91001:92001(1000) ack 1
+0     >  . 92001:93001(1000) ack 1
+0     >  . 93001:94001(1000) ack 1

+0.001   <   . 1:1(0) ack 66001 win 65535
+0     >  . 94001:95001(1000) ack 1
+0     >  . 95001:96001(1000) ack 1
+0     >  . 96001:97001(1000) ack 1
+0     >  . 97001:98001(1000) ack 1

+0.001   <   . 1:1(0) ack 68001 win 65535
+0     >  . 98001:99001(1000) ack 1
+0     > P. 99001:100001(1000) ack 1

+0.001   <   . 1:1(0) ack 70001 win 65535

+0.001   <   . 1:1(0) ack 72001 win 65535

+0.001   <   . 1:1(0) ack 74001 win 65535

+0.001   <   . 1:1(0) ack 76001 win 65535

+0.001   <   . 1:1(0) ack 78001 win 65535

+0.001   <   . 1:1(0) ack 80001 win 65535

+0.001   <   . 1:1(0) ack 82001 win 65535

+0.001   <   . 1:1(0) ack 84001 win 65535

+0.001   <   . 1:1(0) ack 86001 win 65535

+0.001   <   . 1:1(0) ack 88001 win 65535

+0.001   <   . 1:1(0) ack 90001 win 65535

+0.001   <   . 1:1(0) ack 92001 win 65535

+0.001   <   . 1:1(0) ack 94001 win 65535

+0.001   <   . 1:1(0) ack 96001 win 65535

+0.001   <   . 1:1(0) ack 98001 win 65535

+0.001   <   . 1:1(0) ack 100001 win 65535
+0.010 close(4) = 0
+0 > F. 100001:100001(0) ack 1
+0.1 < F. 1:1(0) ack 100002 win 65535

`
sleep 1
killall -SIGINT tcpdump
`

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 59520
Build 56407: arc lint + arc unit

Event Timeline

  • set snd_recover to snd_fack on RTO
  • pull snd_nxt right during SACK after RTO
  • init recover to fack on RTO
  • pull snd_nxt right during SACK after RTO
  • only enter FastRecovery per RFC6675 when on the right edge
  • prevent retransmitting of covered SACK holes when FR activates as snd_nxt closes with snd_max
  • prevent sack-resending at nxt==max. Will still enter FR (and CC reaction)
  • Revert previous commit, as it can lead to SACK accounting panics
  • pull snd_nxt forward when transmitting from SACK hole
  • only enter FR by 6675 when at the right edge, and when after RTO, only when new SACK holes show up
  • update sack_adjust comments

With this patch, the classic (pre-SACK) loss-recovery "indication" of snd_nxt < snd_max as well as the IN_FASTRECOVERY() flags will follow the SACK transmission selection path, if appropriate. The difference being that during IN_FASTRECOVERY, the per-ACK CC reaction (cwnd adjustment) is disabled - except for PRR - while in the case of (snd_nxt < snd_max), the CC controls the congestion window (e.g. uses slow start after the RTO until reaching ssthresh, then using congestion avoidance).

tcp_sack_adjust will also inflate cwnd (what normally - IN_FASTRECOVERY - happens when calculating the pipe).

In addition, the RFC6675 trigger to enter Fast Recovery (an ACK with a SACK option covering at least 2*MSS+1 bytes) is only active while snd_nxt is at the right edge, and requires additional indications of new losses right after a Retransmission Timeout.

In order for normal SACK processing to work correctly, after successfully transmitting a packet, snd_nxt has to be dragged along - eventually becoming equal to snd_max again, thus the additional checks in the RFC6675 FR trigger.

  • when retransmitting after RTO, don't inflate cnwd to nxt-una+mss, but only cwnd+mss
  • when retransmitting from (nxt<max), use cwnd only
  • simplify 6675 FR trigger condition
This revision is now accepted and ready to land.Apr 4 2024, 12:45 PM

Your summary section claims "In addition, cwnd used to be 1 MSS right after RTO, increasing to 2 MSS more recently." But I could not find the code change where cwnd is changed to 2MSS after RTO. Please elaborate if the summary needs to be revised or I am missing the point.

This revision now requires review to proceed.Wed, Sep 18, 8:08 AM