Page MenuHomeFreeBSD

Send CWR only on new data, as per sec. 6.1.2 of RFC3168
ClosedPublic

Authored by rscheff on Jan 25 2020, 6:15 PM.

Details

Summary

RFC3168 has a SHOULD clause to send CWR only on *new* *data* segments.

Overly conservative data receivers check this condition and ignore the
CWR flag on other packets (pure ACK and Control).

If the FreeBSD data sender sends out a pure ACK before new data, with the CWR
this can result in a race to the bottom, where the data receiver keeps
ECE latched on, and the data sender continously reducing cwnd until some
very low value.

This can be observed in particular on request/response RPC sessions, where
only minimal data is exchanged before waiting (after the TCP ACK) on the
response from the other side.

Test Plan

Will attach packetdrill script to validate that CWR is retained
until new data is being sent out.

For testing that "fresh" window probes are sent with ECT0, while repeat probes are not:

// A simple test that validates the transmission of
// new window probe data with the ECT0 codepoint
// while repeated window probes are sent without ECT

// Flush Hostcache
+0.00 `sysctl net.inet.tcp.ecn.enable=1`
+0.02 `sysctl net.inet.tcp.hostcache.purgenow=1`

// Create a connecting TCP socket.
+0.010 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.005 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0.005 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.005 setsockopt(3, SOL_SOCKET, SO_SNDBUF, [1048576], 4) = 0
+0     setsockopt(3, SOL_SOCKET, SO_DEBUG, [1], 4) = 0

// Establish a TCP connection
+0.035 connect(3, ..., ...) = -1 // EINPROGRESS

+0 >[noecn] SEW  0:0(0) win 65535 <mss 1460, nop,wscale 6,sackOK,TS val ... ecr 0>
+0 <[noecn] SE. 0:0(0) ack 1 win 65535 <mss 1000>
+0 >[noecn]  .   1:1(0) ack 1

+0.01 getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
+0.01 fcntl(3, F_SETFL, O_RDWR) = 0   // set back to blocking

// Send some data from server
+0.1 write(3, ..., 5000) = 5000
+0 >[ect0] .       1:1001(1000) ack 1  // we drop this packet
+0 >[ect0] .    1001:2001(1000) ack 1
+0 >[ect0] .    2001:3001(1000) ack 1
+0 >[ect0] .    3001:4001(1000) ack 1
+0 >[ect0] P.   4001:5001(1000) ack 1

// Shrink receive window to zero
+0 <[noecn] .       1:1(0) ack 5001 win 0

// and send more data
+0 write(3, ..., 1000) = 1000

// check for window probe (persist timer)
// as this is new data, ECT0 should be set
+5~+6 >[ect0] . 5001:5002(1) ack 1
//+0 %{ print "{}\t{}".format(tcpi_last_data_recv,tcpi_rcv_space) }%
+0 <[noecn] .       1:1(0) ack 5001 win 0

// check for repeated window probe
+5~+6 >[noecn] . 5001:5002(1) ack 1
+0 <[noecn] .       1:1(0) ack 5002 win 0

//this probe should be ect0 marked again
+5~+6 >[ect0]  . 5002:5003(1) ack 1
+0 <[noecn] .       1:1(0) ack 5003 win 1

// and a regular 1-byte data segment
+0~+6 >[ect0] .     5003:5004(1) ack 1
+0 <[noecn] .       1:1(0) ack 5004 win 10
+0~+6 >[ect0]  .    5004:5014(10) ack 1
+0 <[noecn] .       1:1(0) ack 5014 win 2000
+0 >[ect0]  P.    5014:6001(987) ack 1
+0 <[noecn] .       1:1(0) ack 6001 win 65535

+0.1 write(3, ..., 1) = 1

+0 >[ect0] P.     6001:6002(1) ack 1
+0 <[noecn] .     1:1(0) ack 6002 win 65535

+0.100 close(3) = 0
+0 <[noecn] F. 1:1(0) ack 6002 win 65535
+0 >[noecn] F. 6002:6002(0) ack 1
+0 <[noecn]  . 2:2(0) ack 6003 win 65535
+0 >[noecn] F. 6002:6002(0) ack 2

// Restore default sysctls
`sysctl net.inet.tcp.ecn.enable=2`

And here the script to validate that CWR is only sent on new data, and no premature CC reaction happens after a full ACK:

--tolerance_usecs=250000

// Flush Hostcache
+0.00 `sysctl net.inet.tcp.ecn.enable=1`
//+0.02 `kldload -n cc_cubic`
+0.02 `sysctl net.inet.tcp.cc.algorithm=cubic`
+0.02 `sysctl net.inet.tcp.hostcache.purgenow=1`

// Create a listening TCP socket.
 0.100 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.005 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.005 bind(3, ..., ...) = 0
+0.005 listen(3, 1) = 0

// Establish a TCP connection with ECN
+0.035 <[noecn] SEW  0:0(0) win 65535 <mss 1000, sackOK, eol, eol>
+0.000 >[noecn] SE. 0:0(0) ack 1 win 65535 <...>
+0.000 <[noecn]  . 1:1(0) ack 1 win 65535
+0.000 accept(3, ..., ...) = 4

+0 %{ print "\tcwnd\twnd \tssthresh\tNXT" }%
+0 %{ print "\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%


// Write one Initial Window of Date
+0.1 write(4, ..., 10000) = 10000
+0 >[ect0]  .     1:1001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 1001 win 65535
+0 >[ect0]  .  1001:2001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 2001 win 65535
+0 >[ect0]  .  2001:3001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 3001 win 65535
+0 >[ect0]  .  3001:4001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 4001 win 65535
+0 >[ect0]  .  4001:5001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 5001 win 65535
+0 >[ect0]  .  5001:6001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 6001 win 65535
+0 >[ect0]  .  6001:7001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 7001 win 65535
+0 >[ect0]  .  7001:8001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 8001 win 65535
+0 >[ect0]  .  8001:9001(1000) ack 1
+0.01 <[noecn] E. 1:1(0) ack 9001 win 65535
+0 >[ect0] P.  9001:10001(1000) ack 1
+0 %{ print "IW\t{}\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt, tcpi_snd_rexmitpack) }%

+0.01 <[noecn] E. 1:1(0) ack 10001 win 65535
+0 %{ print "ece\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%

+0.01 <[ect0]  .     1:1001(1000) ack 10001 win 65535
+0 %{ print "d1\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%
+0 <[ect0]    PE.  1001:2001(1000) ack 10001 win 65535
+0 %{ print "d2\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%

+0 >[noecn]    . 10001:10001(0) ack 2001
+0 read(4, ..., 2000) = 2000

+0.01 <[ce]    E.  2001:3001(1000) ack 10001 win 65535
+0 <[ect0]     PE.  3001:4001(1000) ack 10001 win 65535

+0 >[noecn]   E. 10001:10001(0) ack 4001
+0 %{ print "3\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%

// Test if ECE becomes unlatched on PureACK with CWR
+0.01 <[noecn] WE. 4001:4001(0) ack 10001 win 65535
+0 >[noecn]   . 10001:10001(0) ack 4001
+0 %{ print "4\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%

+0 <[ect0]  E. 4001:5001(1000) ack 10001 win 65535
+0 <[ect0]  E. 5001:6001(1000) ack 10001 win 65535

// This pure ACK shouldn't have ECE set
+0 >[noecn]   . 10001:10001(0) ack 6001
+0 %{ print "5\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%

// Write a single byte, validate CWR from IW Ack
+0.01 write(4, ..., 1) = 1
+0 >[ect0]  PW. 10001:10002(1) ack 6001
+0 %{ print "6\t{}\t{}\t{}\t{}".format(tcpi_snd_cwnd, tcpi_snd_wnd, tcpi_snd_ssthresh, tcpi_snd_nxt) }%


+0.100 close(4) = 0
+0 <[noecn] F. 6001:6001(0) ack 10002 win 65535
+0 >[noecn] F. 10002:10002(0) ack 6001
+0 <[noecn] . 6002:6002(0) ack 10003 win 65535
+0 >[noecn] F. 10002:10002(0) ack 6002

// Restore default sysctls
`sysctl net.inet.tcp.ecn.enable=2; sysctl net.inet.tcp.cc.algorithm=newreno`

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

  • Merge branch 'master' into fix-3168-cwr
sys/netinet/tcp_output.c
1166 ↗(On Diff #67291)

I don't quite understand this conditional; I assume this is to prevent window probes (which may be retransmissions) to be ECT-marked. However, this also affects persist-timer triggered 1-byte data segments which never have been sent before.

While strictly speaking, this is a potential bug (delaying CWR erraneously, running into more than 1 window of ECE and an additional cwnd reduction), the impact should be fairly limited and may not warrant additional checks...

The output in Phabricator GUI looks all messed up, the raw diff looks fine though.

sys/netinet/tcp_output.c
1166 ↗(On Diff #67291)

Perhaps the block comment above needs to be expanded, but that should be in a seperate change, IMHO.

rscheff marked an inline comment as done.
  • fix snd_recover for ECN handling

While testing this patch in detail, I found another issue: as ECN recovery doesn't really keep the snd_una to the left, the current EXIT_RECOVERY code will leave recovery on a full ack when no data was outstanding at the time, the ece was received.

With this change, the CWR will not have been sent, and the other side - when sending additional data with ECE, will cause the code to re-enter ECN recovery, for 2 back-to-back congestion reactions.

sys/netinet/tcp_output.c
1166 ↗(On Diff #67291)

I'm trying to get a commit log from Rui Paulo, or some other validation as to what the intent has been around this condition.

  • add CWR-on-new-data related changes to RACK
  • keep comment in RACK in-sync too

What RFC3168 asks to do is to send out the CWR flag together with the byte at [snd_max+1]. Thus setting snd_recover to snd_max (before CWR could have arrived on the other end) can lead to premature exit of congrecovery - while the other end still has the ECE latched. That would result in 2 back to back congestion control reactions.

On top of that, the (new) data segment carrying the CWR also has the ECT codepoint, and could become CE-marked by the network. So, when the ACK to the data segment with the CWR arrives, and still has the ECE flag set, that indicates it was CE marked, and another congestion control reaction is necessary. Thus the explicit EXIT_CONGRECOVERY here, and the additional check (like later on in the regular processing) if the ACK did cover the snd_recover point....

sys/netinet/tcp_output.c
1166 ↗(On Diff #67291)

So, after some pondering about this: If (for whatever reason) cwnd ends up at zero, and the (persist) timer is clocking out new 1-byte payload segments, those should still be ECT-marked (and allowed to carry a CWR).

I propose to add a check like

((tp->t_flags & TF_FORCEDATA) && len == 1 && SEQ_LT(tp->snd_una, tp->snd_max)

on the basis that zero window probes don't get ACKed until the receiver window can hold them (and then, the next window probe would be a new data segment).

For a normal window probe (when there is additional data to send in the socket), only the first probe would be ECT0 marked; the byte would not get acked, and all subsequent probes would be retransmissions.

  • also mark fresh window probes to be sent with ECT
  • minor update to comments
sys/netinet/tcp_stacks/rack.c
1808 ↗(On Diff #68886)

At this stage, the congestion episode was not yet finished on an ACK covering snd_recover (both loss or ECN cc-reaction).

As per RFC guidance, only a single CC reaction per window is required; Since ECN and loss are tracked by two flags, we may enter (and reduce cwnd) independently.
However, exit is done only when th_ack >= snd_recover - and snd_recover is pulled forward by whatever happened last.

Most problematic may be non-SACK, NewReno recovery, where an ECN after loss could prolong the fast recovery episode (a subsequent loss episode may not result in another cc reaction, if overlapped by an ECN - however, there already were two cc reactions earlier).

1811 ↗(On Diff #68886)

As CWR will be sent only on the next new data segment, the congestion epoch is over at snd_max+1 with ECN, rather than when snd_max is cumulative acked as in loss recovery.

9491 ↗(On Diff #68886)

this is to address the case, where persist window probes are done using new 1-octet data segments.

I would like to see some "area of expertise" comments on this change before I can approve it as a mentor.

  • nicefy addition with whitespaces
  • merge large RACK commit
rscheff added inline comments.
sys/netinet/tcp_stacks/rack.c
13522–13523 ↗(On Diff #71504)

rS360639 had removed the simple check for window probe packets (snd_nxt == snd_max, len==1, but no forward progress).

@rrs what was the motivation for that change, how do you exclude setting ECT0 on window probes in RACK?

  • RACK doesn't use 1-byte window probes (or FORCEDATA) now
This revision is now accepted and ready to land.May 21 2020, 2:16 PM