Page MenuHomeFreeBSD

improvements to support code for RFC6675
ClosedPublic

Authored by rscheff on Dec 20 2018, 7:44 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 18, 8:10 AM
Unknown Object (File)
Fri, Jan 17, 11:42 AM
Unknown Object (File)
Thu, Jan 16, 2:22 PM
Unknown Object (File)
Thu, Jan 16, 12:50 AM
Unknown Object (File)
Sat, Jan 11, 9:01 PM
Unknown Object (File)
Thu, Jan 9, 6:33 PM
Unknown Object (File)
Thu, Jan 9, 6:31 PM
Unknown Object (File)
Thu, Jan 9, 5:48 PM
Subscribers

Details

Summary

Added proper accounting of sacked_bytes and (per-ACK) delivered data.

Previous code was susceptible to improperly calculate these values, when

a) more than 3 (or 4, without TS) holes were to be reported
b) simplistic clients only sending a SACK block covering the last received data segment (simplistic IoT devices)
c) malicious receivers sending duplicate SACK blocks in the ACK to deflate pipe (possibly trigger burst transmissions by the sender).

As net.inet.tcp.rfc6675 controls this particular feature, and is disabled by default, the above issues are not critical.

This patch will allow more aspects of RFC6675 to be implemented. One is when Loss Recovery is invoked, the other is the Rescue Retransmission when TCP is application limited.

Test Plan

base:

10 ## 12 13 ## 15 16 ## 18 19 ## 21 22 ## 24 25 ## 27 28 ## 30
               11                   14    17 20    23 26    29

old 6675-pipe:

10 ## 12 13 ## 15 16 ## 18 19 ## 21 22 ## 24 25 ## 27 28 ## 30
               11
         14,17    20,23    26,29

D18624 6675-pipe:

10 ## 12 13 ## 15 16 ## 18 19 ## 21 22 ## 24 25 ## 27 28 ## 30
               11                                     14    17
         20,23    26,29

First, an example of the non-RFC6675-pipe SACK loss recovery packet sequence:

// Testing base-stack SACK loss recovery
// with only a single SACK option block
// returned from the receiver

--tolerance_usecs=100000

0.0 `sync` // in case of crash
+0.01 `sysctl net.inet.tcp.cc.algorithm=newreno`
+0.01 `sysctl net.inet.tcp.initcwnd_segments=10`
+0.01 `sysctl net.inet.tcp.ecn.enable=0`
+0.01 `sysctl net.inet.tcp.rfc6675_pipe=0`
+0.01 `sysctl net.inet.tcp.hostcache.purgenow=1`

// Create a listening TCP socket.
+0.50 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.01 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.01 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
+0.01 setsockopt(3, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
+0.01 bind(3, ..., ...) = 0
+0.01 listen(3, 1) = 0

// Establish a TCP connection.
+0.04 < S  0:0(0) win 65535 <mss 1000, sackOK, wscale 10, eol, nop, nop >
+0.00 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,wscale 6,sackOK,eol,eol>
+0.00 <  . 1:1(0) ack 1 win 65535
+0.00 accept(3, ..., ...) = 4
+0.01 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0

// ACK initial window, to grow cwnd to sufficient size
1.00 write(4, ..., 30000) = 30000
+0 > .    1:1001(1000) ack 1
+0 < .    1:1(0) ack 1001 win 65535
+0 > . 1001:2001(1000) ack 1
+0 < .    1:1(0) ack 2001 win 65535
+0 > . 2001:3001(1000) ack 1
+0 < .    1:1(0) ack 3001 win 65535
+0 > . 3001:4001(1000) ack 1
+0 < .    1:1(0) ack 4001 win 65535
+0 > . 4001:5001(1000) ack 1
+0 < .    1:1(0) ack 5001 win 65535
+0 > . 5001:6001(1000) ack 1
+0 < .    1:1(0) ack 6001 win 65535
+0 > . 6001:7001(1000) ack 1
+0 < .    1:1(0) ack 7001 win 65535
+0 > . 7001:8001(1000) ack 1
+0 < .    1:1(0) ack 8001 win 65535
+0 > . 8001:9001(1000) ack 1
+0 < .    1:1(0) ack 9001 win 65535
+0 > . 9001:10001(1000) ack 1
+0 < .    1:1(0) ack 10001 win 65535

// in the next window, we simulate the losses
+0 > . 10001:11001(1000) ack 1  // drop
+0 > . 11001:12001(1000) ack 1
+0 > . 12001:13001(1000) ack 1
+0 > . 13001:14001(1000) ack 1  // drop
+0 > . 14001:15001(1000) ack 1
+0 > . 15001:16001(1000) ack 1
+0 > . 16001:17001(1000) ack 1  // drop
+0 > . 17001:18001(1000) ack 1
+0 > . 18001:19001(1000) ack 1
+0 > . 19001:20001(1000) ack 1  // drop
+0 > . 20001:21001(1000) ack 1
+0 > . 21001:22001(1000) ack 1
+0 > . 22001:23001(1000) ack 1  // drop
+0 > . 23001:24001(1000) ack 1
+0 > . 24001:25001(1000) ack 1
+0 > . 25001:26001(1000) ack 1  // drop
+0 > . 26001:27001(1000) ack 1
+0 > . 27001:28001(1000) ack 1
+0 > . 28001:29001(1000) ack 1  // drop
+0 > P. 29001:30001(1000) ack 1

// and now we validate the ordering of the retransmissions
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 11001:12001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 11001:13001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 14001:15001, nop, nop>
* > . 10001:11001(1000) ack 1
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 14001:16001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 17001:18001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 17001:19001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 20001:21001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 20001:22001, nop, nop>
* > . 13001:14001(1000) ack 1
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 23001:24001, nop, nop>
* > . 16001:17001(1000) ack 1
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 23001:25001, nop, nop>
* > . 19001:20001(1000) ack 1
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 26001:27001, nop, nop>
* > . 22001:23001(1000) ack 1
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 26001:28001, nop, nop>
* > . 25001:26001(1000) ack 1
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 29001:30001, nop, nop>
* > . 28001:29001(1000) ack 1
+0.01 < .  1:1(0) ack 13001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 16001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 19001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 22001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 25001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 28001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 30001 win 65535
+1.00 close(4) = 0

`sysctl net.inet.tcp.rfc6675_pipe=0`

Then, the old RFC6675-pipe sequence:

// Testing D18624 RFC6675 pipe SACK loss recovery
// with only a single SACK option block
// returned from the receiver

--tolerance_usecs=100000

0.0 `sync` // in case of crash
+0.01 `sysctl net.inet.tcp.cc.algorithm=newreno`
+0.01 `sysctl net.inet.tcp.initcwnd_segments=10`
+0.01 `sysctl net.inet.tcp.ecn.enable=0`
+0.01 `sysctl net.inet.tcp.rfc6675_pipe=1`
+0.01 `sysctl net.inet.tcp.hostcache.purgenow=1`

// Create a listening TCP socket.
+0.50 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.01 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.01 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
+0.01 setsockopt(3, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
+0.01 bind(3, ..., ...) = 0
+0.01 listen(3, 1) = 0

// Establish a TCP connection.
+0.04 < S  0:0(0) win 65535 <mss 1000, sackOK, wscale 10, eol, nop, nop >
+0.00 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,wscale 6,sackOK,eol,eol>
+0.00 <  . 1:1(0) ack 1 win 65535
+0.00 accept(3, ..., ...) = 4
+0.01 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0

// ACK initial window, to grow cwnd to sufficient size
1.00 write(4, ..., 30000) = 30000
+0 > .    1:1001(1000) ack 1
+0 < .    1:1(0) ack 1001 win 65535
+0 > . 1001:2001(1000) ack 1
+0 < .    1:1(0) ack 2001 win 65535
+0 > . 2001:3001(1000) ack 1
+0 < .    1:1(0) ack 3001 win 65535
+0 > . 3001:4001(1000) ack 1
+0 < .    1:1(0) ack 4001 win 65535
+0 > . 4001:5001(1000) ack 1
+0 < .    1:1(0) ack 5001 win 65535
+0 > . 5001:6001(1000) ack 1
+0 < .    1:1(0) ack 6001 win 65535
+0 > . 6001:7001(1000) ack 1
+0 < .    1:1(0) ack 7001 win 65535
+0 > . 7001:8001(1000) ack 1
+0 < .    1:1(0) ack 8001 win 65535
+0 > . 8001:9001(1000) ack 1
+0 < .    1:1(0) ack 9001 win 65535
+0 > . 9001:10001(1000) ack 1
+0 < .    1:1(0) ack 10001 win 65535

// in the next window, we simulate the losses
+0 > . 10001:11001(1000) ack 1  // drop
+0 > . 11001:12001(1000) ack 1
+0 > . 12001:13001(1000) ack 1
+0 > . 13001:14001(1000) ack 1  // drop
+0 > . 14001:15001(1000) ack 1
+0 > . 15001:16001(1000) ack 1
+0 > . 16001:17001(1000) ack 1  // drop
+0 > . 17001:18001(1000) ack 1
+0 > . 18001:19001(1000) ack 1
+0 > . 19001:20001(1000) ack 1  // drop
+0 > . 20001:21001(1000) ack 1
+0 > . 21001:22001(1000) ack 1
+0 > . 22001:23001(1000) ack 1  // drop
+0 > . 23001:24001(1000) ack 1
+0 > . 24001:25001(1000) ack 1
+0 > . 25001:26001(1000) ack 1  // drop
+0 > . 26001:27001(1000) ack 1
+0 > . 27001:28001(1000) ack 1
+0 > . 28001:29001(1000) ack 1  // drop
+0 > P. 29001:30001(1000) ack 1

// and now we validate the ordering of the retransmissions
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 11001:12001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 11001:13001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 14001:15001, nop, nop>
* > . 10001:11001(1000) ack 1
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 14001:16001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 17001:18001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 17001:19001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 20001:21001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 20001:22001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 23001:24001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 23001:25001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 26001:27001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 26001:28001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 13001 win 65535 <sack 29001:30001, nop, nop>
* > . 13001:14001(1000) ack 1
* > . 16001:17001(1000) ack 1
+0.01 < .  1:1(0) ack 16001 win 65535 <sack 29001:30001, nop, nop>
* > . 19001:20001(1000) ack 1
* > . 22001:23001(1000) ack 1
+0.01 < .  1:1(0) ack 19001 win 65535 <sack 29001:30001, nop, nop>
* > . 25001:26001(1000) ack 1
* > . 28001:29001(1000) ack 1
+0.01 < .  1:1(0) ack 22001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 25001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 28001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 30001 win 65535

+1.00 close(4) = 0

`sysctl net.inet.tcp.rfc6675_pipe=0`

And finally, the change with this Patch:

// Testing D18624 RFC6675 pipe SACK loss recovery
// with only a single SACK option block
// returned from the receiver

--tolerance_usecs=100000

0.0 `sync` // in case of crash
+0.01 `sysctl net.inet.tcp.cc.algorithm=newreno`
+0.01 `sysctl net.inet.tcp.initcwnd_segments=10`
+0.01 `sysctl net.inet.tcp.ecn.enable=0`
+0.01 `sysctl net.inet.tcp.rfc6675_pipe=1`
+0.01 `sysctl net.inet.tcp.hostcache.purgenow=1`

// Create a listening TCP socket.
+0.50 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.01 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.01 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
+0.01 setsockopt(3, SOL_SOCKET, SO_DEBUG, [1], 4) = 0
+0.01 bind(3, ..., ...) = 0
+0.01 listen(3, 1) = 0

// Establish a TCP connection.
+0.04 < S  0:0(0) win 65535 <mss 1000, sackOK, wscale 10, eol, nop, nop >
+0.00 > S. 0:0(0) ack 1 win 65535 <mss 1460,nop,wscale 6,sackOK,eol,eol>
+0.00 <  . 1:1(0) ack 1 win 65535
+0.00 accept(3, ..., ...) = 4
+0.01 setsockopt(4, SOL_SOCKET, SO_DEBUG, [1], 4) = 0

// ACK initial window, to grow cwnd to sufficient size
1.00 write(4, ..., 30000) = 30000
+0 > .    1:1001(1000) ack 1
+0 < .    1:1(0) ack 1001 win 65535
+0 > . 1001:2001(1000) ack 1
+0 < .    1:1(0) ack 2001 win 65535
+0 > . 2001:3001(1000) ack 1
+0 < .    1:1(0) ack 3001 win 65535
+0 > . 3001:4001(1000) ack 1
+0 < .    1:1(0) ack 4001 win 65535
+0 > . 4001:5001(1000) ack 1
+0 < .    1:1(0) ack 5001 win 65535
+0 > . 5001:6001(1000) ack 1
+0 < .    1:1(0) ack 6001 win 65535
+0 > . 6001:7001(1000) ack 1
+0 < .    1:1(0) ack 7001 win 65535
+0 > . 7001:8001(1000) ack 1
+0 < .    1:1(0) ack 8001 win 65535
+0 > . 8001:9001(1000) ack 1
+0 < .    1:1(0) ack 9001 win 65535
+0 > . 9001:10001(1000) ack 1
+0 < .    1:1(0) ack 10001 win 65535

// in the next window, we simulate the losses
+0 > . 10001:11001(1000) ack 1  // drop
+0 > . 11001:12001(1000) ack 1
+0 > . 12001:13001(1000) ack 1
+0 > . 13001:14001(1000) ack 1  // drop
+0 > . 14001:15001(1000) ack 1
+0 > . 15001:16001(1000) ack 1
+0 > . 16001:17001(1000) ack 1  // drop
+0 > . 17001:18001(1000) ack 1
+0 > . 18001:19001(1000) ack 1
+0 > . 19001:20001(1000) ack 1  // drop
+0 > . 20001:21001(1000) ack 1
+0 > . 21001:22001(1000) ack 1
+0 > . 22001:23001(1000) ack 1  // drop
+0 > . 23001:24001(1000) ack 1
+0 > . 24001:25001(1000) ack 1
+0 > . 25001:26001(1000) ack 1  // drop
+0 > . 26001:27001(1000) ack 1
+0 > . 27001:28001(1000) ack 1
+0 > . 28001:29001(1000) ack 1  // drop
+0 > P. 29001:30001(1000) ack 1

// and now we validate the ordering of the retransmissions
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 11001:12001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 11001:13001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 14001:15001, nop, nop>
* > . 10001:11001(1000) ack 1
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 14001:16001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 17001:18001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 17001:19001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 20001:21001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 20001:22001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 23001:24001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 23001:25001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 26001:27001, nop, nop>
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 26001:28001, nop, nop>
* > . 13001:14001(1000) ack 1
+0.01 < .  1:1(0) ack 10001 win 65535 <sack 29001:30001, nop, nop>
* > . 16001:17001(1000) ack 1
+0.01 < .  1:1(0) ack 13001 win 65535 <sack 29001:30001, nop, nop>
* > . 19001:20001(1000) ack 1
* > . 22001:23001(1000) ack 1
+0.01 < .  1:1(0) ack 16001 win 65535 <sack 29001:30001, nop, nop>
* > . 25001:26001(1000) ack 1
* > . 28001:29001(1000) ack 1
+0.01 < .  1:1(0) ack 19001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 22001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 25001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 28001 win 65535 <sack 29001:30001, nop, nop>
+0.01 < .  1:1(0) ack 30001 win 65535

+1.00 close(4) = 0

`sysctl net.inet.tcp.rfc6675_pipe=0`

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 21682
Build 20969: arc lint + arc unit

Event Timeline

  • whitespace and comment changes
  • fixed accounting error with partial or cumulative acks
  • Another accounting oversight. RFC6675 allows a partial ack with SACK info to immediately enter fast recovery, where this could result in inappropriate tracking of sacked_bytes.

A packetdrill script to verfiy the proper operation of the 6675 pipe calculation.

Scenario: Client can only provide incomplete SACK information (1 block only), and expiriences 2 drops in one window - first one starts loss recovery, and the 2nd drop is not consecutive with the first. Without the proper accounting, this will delay the opening of the cwnd for the retransmissions, or subsequent new transmissions. .Worst case (as demonstrated in this script) is the 2nd drop happening 180° out of phase, in the middle of the window. This will only allow the original retransmission through, but not the 2nd retransmission or any additional data as cwnd won't be opened up.

With this patch, cwnd will grow, as expected, during the 2nd half of the window, considering the two dropped segments. (and with PRR, retransmissions will be "dithered" over the entire window, as necessary.

  • fixing curly bracket if branch, semantically moving check for new sack data to tcp_input
  • fixing curly bracket if branch
  • semantically moving check for new sack data to tcp_input
  • fixed wrong deref to TO struct
sys/netinet/tcp_sack.c
476

brackets in the if clause missing, sack_changed needs to be set too.

572

Move this to tcp_input:2670, as sack_changed should semantically caputer any changes to the scoreboard. Which does happen on partial / cumulative acks.

This looks like it does what is described. As I understand, the use of delivered_data will be covered by a separate review. It looks like this will slightly change the way sacked_bytes is calculated. The change is probably a good thing, but it is worth verifying (and I haven't done this yet) that the updated calculation will work correctly.

In D18624#458099, @jtl wrote:

This looks like it does what is described. As I understand, the use of delivered_data will be covered by a separate review. It looks like this will slightly change the way sacked_bytes is calculated. The change is probably a good thing, but it is worth verifying (and I haven't done this yet) that the updated calculation will work correctly.

@rrs & @lstewart : This has been in review for 6+ months. Can one of you please verify that correcting the way sacked_bytes is calculated will do no harm to CC calculations?

Thanks!

  • Merge branch 'master'
  • whitespace fix in comment
sys/netinet/tcp_input.c
2722

Add comment why this additional check for presence of a SACK block has been added (semantic of variable sack_changed now tracks scoreboard changes due to partial ACKs (without SACK blocks) too, but these should be excluded from increasing the dupthesh counter.

This improvement of the SACK scoreboard accounting is actually what Jason seems to have asked for in D3971 when the RFC6675 pipe calculation was added.

As mentioned verbally, this improvement will address slight differences of the pipe value which can occur due to incomplete tracking of the received segements due to only looking on the currently received SACK blocks in the ACK. Once more than 3 holes are present in a window, the previous RFC6675 pipe calculation would end up with a too large pipe estimate, not sending SACK retransmissions as early as allowed and suffering a higher probability of having to resort to RTO as a last resort for recovering multiple losses.

  • add comment to explain the semantic change to sack_changed

This has been in process for 15 months and should be evolving in tree rather than in a review at this point, IMHO. If no one objects to it being committed in 7 days I am asking that Richard go ahead and commit it.

It's been 6 weeks now (5 weeks after the suggested "speak up now" deadline), still require a transport sign-off, before committing...

rscheff edited the test plan for this revision. (Show Details)
  • Merge branch 'master'
This revision is now accepted and ready to land.Aug 10 2020, 8:52 PM
This revision was automatically updated to reflect the committed changes.