Page MenuHomeFreeBSD

RFC6582 - prevent cwnd to collapse down to 1 mss after exiting recovery
ClosedPublic

Authored by rscheff_gmx.at on Oct 18 2018, 10:03 PM.

Details

Summary

Under adverse conditions during loss recovery

  • limited client receive window
  • ACK thinning / ACK loss
  • application limited (unsufficient data while in recovery)

pipe can collapse to very small levels, even down to 0 bytes.

RFC6582 is an adopted standards track RFC, updating RFC3782,
addressing this issue. With this patch, FreeBSD can claim compliance
with the more modern RFC

(see https://wiki.freebsd.org/TransportProtocols/tcp_rfc_compliance )

Test Plan

TCP client with small receive window (compared to BDP), while the sender is effectively rwnd limited, induce one data packet loss, and also thin out the returned (duplicate) ACKs. When the client has delayed ACK enabled (default), there is a 50:50 chance that traffic resumes only after the delayed ack timeout (on the client).

With this patch, the clients delayed ack timeout should never gate the restoration of traffic when exiting loss recovery.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

rscheff_gmx.at edited the summary of this revision. (Show Details)Dec 15 2018, 8:40 PM
rscheff_gmx.at updated this revision to Diff 52065.

Minor comment edit
and moving to GIT/Phabricator/ARC workflow

This packetdrill script should complete without error, when IW10 and the above patch are applied, for a SACK session, or non-SACK session.

The following script models the timing of the unpatched BSD13 stack, where cwnd collapses to 1, when insufficient ACKs are received during loss recovery.

Thanks for the review request.
I will test this patch in Emulab.net before I give more feedback.

Attached is a tcptrace of a real-world observed issue, where the lack of RFC6582 results in cwnd shrinking down to 1 MSS, followed by delayed ACK timeout and congestion avoidance growth of cwnd (1 MSS per RTT).

Note that only approximately 1/3 to 1/4 of the expected ACKs arrive at the sender.

chengc_netapp.com added a comment.EditedJan 15 2019, 8:54 PM

I have been testing this patch against a stable/11 build. Over a 1Gb/s link with emulated 40ms RTT and (10^-4) loss rate, I use iperf from a FreeBSD node to send traffic to a 4.15.0-39-generic Ubuntu16.04 client.

40ms link delay with 0.0001 (10^-4) loss rate
ping -c 3 r1
PING r1-link1 (10.1.2.3): 56 data bytes
64 bytes from 10.1.2.3: icmp_seq=0 ttl=64 time=40.001 ms
64 bytes from 10.1.2.3: icmp_seq=1 ttl=64 time=39.882 ms
64 bytes from 10.1.2.3: icmp_seq=2 ttl=64 time=39.939 ms

iperf -c r1 -i 10 -t 60

Client connecting to r1, TCP port 5001

TCP window size: 32.8 KByte (default)

[ 3] local 10.1.2.2 port 33710 connected with 10.1.2.3 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 23.0 MBytes 19.3 Mbits/sec
[ 3] 10.0-20.0 sec 36.4 MBytes 30.5 Mbits/sec
[ 3] 20.0-30.0 sec 49.2 MBytes 41.3 Mbits/sec
[ 3] 30.0-40.0 sec 54.9 MBytes 46.0 Mbits/sec
[ 3] 40.0-50.0 sec 48.6 MBytes 40.8 Mbits/sec
[ 3] 50.0-60.0 sec 32.5 MBytes 27.3 Mbits/sec
[ 3] 0.0-60.1 sec 245 MBytes 34.2 Mbits/sec

Using siftr, I still see the single MSS cwnd, and sometimes with a 40ms delay to update a second cwnd. The full cwnd log is attached.

The congestion control in use is newreno.

timestamp cwnd ssthresh
...
1.92838096618652 115052 70875
1.92838382720947 1448 56940 <<< single MSS
1.96786689758301 2896 56940 <<< 40ms delay
1.96786999702454 4344 56940
1.96787786483765 5792 56940
1.96788096427917 7240 56940

I have been testing this patch against a stable/11 build. Over a 1Gb/s link with emulated 40ms RTT and (10^-4) loss rate, I use iperf from a FreeBSD node to send traffic to a 4.15.0-39-generic Ubuntu16.04 client.
[...]
Using siftr, I still see the single MSS cwnd, and sometimes with a 40ms delay to update a second cwnd. The full cwnd log is attached.


The congestion control in use is newreno.
timestamp cwnd ssthresh
...
1.92838096618652 115052 70875
1.92838382720947 1448 56940 <<< single MSS
1.96786689758301 2896 56940 <<< 40ms delay
1.96786999702454 4344 56940
1.96787786483765 5792 56940
1.96788096427917 7240 56940

For SACK-enabled flows, the cwnd will get set to MSS when *entering* the loss recovery (fast retransmission) phase, which I believe is what you are pointing here (the ssthresh is set to 1/2 cwnd at that very same moment). See http://bxr.su/FreeBSD/sys/netinet/tcp_input.c#2604, which is where this happens for a SACK TCP session.

Over the course of loss recovery, cwnd is supposed to grow again to ~ssthresh (which is set to beta cwnd prior to the congestion event), at a rate of 1 mss per ack.

The patch is to fix when due to e.g. ACK thinning / loss, fewer than 1 ACK per segment arrive at the sender and cwnd does not grow to more than 1 mss when *exiting* recovery.

This may show up in the siftr trace as a Delay of one delayed-ack timeout (typically ~100 ms) after the ssthresh adjustment, and cwnd growing only very slowly. However, the siftr trace does not show any indication of transmitted segments being delayed by the receivers delayed ACK timeout (most packet delta delays are bursts and 1 RTT / 40ms, a few are 2 RTT / 80 ms).

I believe, setting up a dramatically higher packet loss probability on the return path (receiver -> sender) to get a high fraction of ACKs lost (at least 50%) is necessary to trigger the particular instance, which this patch fixes.

Here is the output of the now functional siftr, without and with the patch;

Note that due to the near complete lack of ACKs in the packetdrill script, cwnd never grows, and remains 1 MSS (set to 1000 here for easy human consumption) throughout until the RTO, to really trigger this corner case

i,0x00000000,1547810107.990995,192.168.0.1,8080,192.0.2.1,12988,1073725440,40001,0,33553920,66000,9,6,4,1000,0,1,608,23,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547810107.991020,192.168.0.1,8080,192.0.2.1,12988,1073725440,40001,0,33553920,66000,9,6,4,1000,0,1,608,23,57576,36000,66000,0,36000,0,0,0
o,0x00000000,1547810107.991024,192.168.0.1,8080,192.0.2.1,12988,20000,1000,0,33553920,66000,9,6,4,1000,0,1,537920096,23,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547810107.991071,192.168.0.1,8080,192.0.2.1,12988,20000,1000,0,33553920,66000,9,6,4,1000,0,1,537920096,23,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547810107.992316,192.168.0.1,8080,192.0.2.1,12988,20000,1000,0,33553920,66000,9,6,4,1000,0,1,537920096,23,57576,46000,66000,0,36000,0,0,0
o,0x00000000,1547810107.992338,192.168.0.1,8080,192.0.2.1,12988,20000,1000,0,33553920,66000,9,6,4,1000,0,1,608,23,57576,10000,66000,0,0,0,0,0 # first segment after recovery (flags!)
o,0x00000000,1547810108.226675,192.168.0.1,8080,192.0.2.1,12988,2000,1000,0,33553920,66000,9,6,4,1000,0,1,8801,26,57576,10000,66000,0,1000,0,0,0 # RTO
i,0x00000000,1547810108.227856,192.168.0.1,8080,192.0.2.1,12988,2000,1000,0,33553920,66000,9,6,6,1000,0,1,8800,26,57576,10000,66000,0,1000,0,0,0

In comparison, with the patch:

i,0x00000000,1547812683.728226,192.168.0.1,8080,192.0.2.1,63512,1073725440,40001,0,33553920,66000,9,6,4,1000,24,1,608,24,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547812683.728268,192.168.0.1,8080,192.0.2.1,63512,1073725440,40001,0,33553920,66000,9,6,4,1000,24,1,608,24,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547812683.728288,192.168.0.1,8080,192.0.2.1,63512,1073725440,40001,0,33553920,66000,9,6,4,1000,24,1,608,24,57576,36000,66000,0,36000,0,0,0
o,0x00000000,1547812683.728292,192.168.0.1,8080,192.0.2.1,63512,20000,1000,0,33553920,66000,9,6,4,1000,24,1,537920096,24,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547812683.728314,192.168.0.1,8080,192.0.2.1,63512,20000,1000,0,33553920,66000,9,6,4,1000,24,1,537920096,24,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547812683.728912,192.168.0.1,8080,192.0.2.1,63512,20000,1000,0,33553920,66000,9,6,4,1000,24,1,537920096,24,57576,46000,66000,0,36000,0,0,0
o,0x00000000,1547812683.728939,192.168.0.1,8080,192.0.2.1,63512,20000,2000,0,33553920,66000,9,6,4,1000,24,1,608,24,57576,10000,66000,0,0,0,0,0 # first segment after recovery (flags!)
o,0x00000000,1547812683.728947,192.168.0.1,8080,192.0.2.1,63512,20000,2000,0,33553920,66000,9,6,4,1000,24,1,608,24,57576,10000,66000,0,1000,0,0,0 # sending 2nd segment after recovery
i,0x00000000,1547812683.829167,192.168.0.1,8080,192.0.2.1,63512,20000,2000,0,33553920,66000,9,6,4,1000,24,1,608,24,57576,10000,66000,0,2000,0,0,0 # ACK (see packetdrill script)
o,0x00000000,1547812683.829183,192.168.0.1,8080,192.0.2.1,63512,20000,4000,0,33553920,66000,9,6,4,1000,57,1,608,32,57576,8000,66000,0,0,0,0,0 # slow start (cwnd <ssthresh)
o,0x00000000,1547812683.829194,192.168.0.1,8080,192.0.2.1,63512,20000,4000,0,33553920,66000,9,6,4,1000,57,1,608,32,57576,8000,66000,0,1000,0,0,0
o,0x00000000,1547812683.829200,192.168.0.1,8080,192.0.2.1,63512,20000,4000,0,33553920,66000,9,6,4,1000,57,1,608,32,57576,8000,66000,0,2000,0,0,0
o,0x00000000,1547812683.829205,192.168.0.1,8080,192.0.2.1,63512,20000,4000,0,33553920,66000,9,6,4,1000,57,1,608,32,57576,8000,66000,0,3000,0,0,0

  • fixing trailing whitespaces
  • remove siftr patch
  • fixing trailing whitespaces

I remember we tried to analyze and improve this and found some unintended consequences between @hiren and @lstewart https://reviews.freebsd.org/D8225 so it got backed out. @lstewart do you remember the details for backing it out?

Looking at D8225, that all seems to be code while in loss recovery. This patch is to restore a sane minimum cwnd once exiting loss recovery - so I don't see how these would be directly related.

Looks good. I think Richard can update more that we recently tested this patch.

This revision is now accepted and ready to land.Jan 31 2019, 4:59 PM

Over the last two or three weeks, we have run a large number of performance regression tests including this patch, in particular again workloads with frequent app-stalls (no additional data to send for about an RTO interval). That type of workload very often causes burst to be transmitted, including self-inflicted packet drops.

This patch showed consistently improvments in througput by approximately 1.2%, independing of CC algorithm (NewReno or Cubic). For streaming type workloads / non-burst type traffic, the have no regressions been observed in our testing.

  • prepare to land
This revision now requires review to proceed.Feb 5 2019, 7:51 PM
lstewart accepted this revision.Mar 28 2019, 1:58 PM
This revision is now accepted and ready to land.Mar 28 2019, 1:58 PM

Lawrence reviewed this during IETF104, Michael volunteered to follup up with the full commit process.

This revision was automatically updated to reflect the committed changes.