Details

Reviewers

lstewart
tuexen
rrs
hiren
cc

Commits

rS347381: Prevent cwnd to collapse down to 1 MSS after exiting recovery.

Summary

Under adverse conditions during loss recovery

limited client receive window
ACK thinning / ACK loss
application limited (unsufficient data while in recovery)

pipe can collapse to very small levels, even down to 0 bytes.

RFC6582 is an adopted standards track RFC, updating RFC3782,
addressing this issue. With this patch, FreeBSD can claim compliance
with the more modern RFC

(see https://wiki.freebsd.org/TransportProtocols/tcp_rfc_compliance )

Test Plan

TCP client with small receive window (compared to BDP), while the sender is effectively rwnd limited, induce one data packet loss, and also thin out the returned (duplicate) ACKs. When the client has delayed ACK enabled (default), there is a 50:50 chance that traffic resumes only after the delayed ack timeout (on the client).

With this patch, the clients delayed ack timeout should never gate the restoration of traffic when exiting loss recovery.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 21595
Build 20898: arc lint + arc unit

Event Timeline

rscheff created this revision.Oct 18 2018, 10:03 PM

Herald added a subscriber: imp. · View Herald TranscriptOct 18 2018, 10:03 PM

rscheff added reviewers: tuexen, rrs, hiren.Nov 29 2018, 10:08 PM

rscheff added a reviewer: cc.Nov 29 2018, 10:30 PM

Minor comment edit
and moving to GIT/Phabricator/ARC workflow

Harbormaster completed remote builds in B21595: Diff 52065.Dec 15 2018, 8:40 PM

This packetdrill script should complete without error, when IW10 and the above patch are applied, for a SACK session, or non-SACK session.

rfc6582-test.pkt6 KBDownload

The following script models the timing of the unpatched BSD13 stack, where cwnd collapses to 1, when insufficient ACKs are received during loss recovery.

non-rfc6582-test.pkt6 KBDownload

Thanks for the review request.
I will test this patch in Emulab.net before I give more feedback.

Attached is a tcptrace of a real-world observed issue, where the lack of RFC6582 results in cwnd shrinking down to 1 MSS, followed by delayed ACK timeout and congestion avoidance growth of cwnd (1 MSS per RTT).

Note that only approximately 1/3 to 1/4 of the expected ACKs arrive at the sender.

SACK-after-loss-annotated.png (751×1 px, 86 KB)

I have been testing this patch against a stable/11 build. Over a 1Gb/s link with emulated 40ms RTT and (10^-4) loss rate, I use iperf from a FreeBSD node to send traffic to a 4.15.0-39-generic Ubuntu16.04 client.

40ms link delay with 0.0001 (10^-4) loss rate
ping -c 3 r1
PING r1-link1 (10.1.2.3): 56 data bytes
64 bytes from 10.1.2.3: icmp_seq=0 ttl=64 time=40.001 ms
64 bytes from 10.1.2.3: icmp_seq=1 ttl=64 time=39.882 ms
64 bytes from 10.1.2.3: icmp_seq=2 ttl=64 time=39.939 ms

iperf -c r1 -i 10 -t 60

Client connecting to r1, TCP port 5001

TCP window size: 32.8 KByte (default)

[ 3] local 10.1.2.2 port 33710 connected with 10.1.2.3 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.0 sec 23.0 MBytes 19.3 Mbits/sec
[ 3] 10.0-20.0 sec 36.4 MBytes 30.5 Mbits/sec
[ 3] 20.0-30.0 sec 49.2 MBytes 41.3 Mbits/sec
[ 3] 30.0-40.0 sec 54.9 MBytes 46.0 Mbits/sec
[ 3] 40.0-50.0 sec 48.6 MBytes 40.8 Mbits/sec
[ 3] 50.0-60.0 sec 32.5 MBytes 27.3 Mbits/sec
[ 3] 0.0-60.1 sec 245 MBytes 34.2 Mbits/sec

Using siftr, I still see the single MSS cwnd, and sometimes with a 40ms delay to update a second cwnd. The full cwnd log is attached.

cwnd_33710.txt106 KBDownload

The congestion control in use is newreno.

timestamp cwnd ssthresh
...
1.92838096618652 115052 70875
1.92838382720947 1448 56940 <<< single MSS
1.96786689758301 2896 56940 <<< 40ms delay
1.96786999702454 4344 56940
1.96787786483765 5792 56940
1.96788096427917 7240 56940

In D17614#402480, @chengc_netapp.com wrote:

I have been testing this patch against a stable/11 build. Over a 1Gb/s link with emulated 40ms RTT and (10^-4) loss rate, I use iperf from a FreeBSD node to send traffic to a 4.15.0-39-generic Ubuntu16.04 client.

[...]

Using siftr, I still see the single MSS cwnd, and sometimes with a 40ms delay to update a second cwnd. The full cwnd log is attached.
cwnd_33710.txt106 KBDownload

The congestion control in use is newreno.

timestamp cwnd ssthresh
...
1.92838096618652 115052 70875
1.92838382720947 1448 56940 <<< single MSS
1.96786689758301 2896 56940 <<< 40ms delay
1.96786999702454 4344 56940
1.96787786483765 5792 56940
1.96788096427917 7240 56940

For SACK-enabled flows, the cwnd will get set to MSS when *entering* the loss recovery (fast retransmission) phase, which I believe is what you are pointing here (the ssthresh is set to 1/2 cwnd at that very same moment). See http://bxr.su/FreeBSD/sys/netinet/tcp_input.c#2604, which is where this happens for a SACK TCP session.

Over the course of loss recovery, cwnd is supposed to grow again to ~ssthresh (which is set to beta cwnd prior to the congestion event), at a rate of 1 mss per ack.

The patch is to fix when due to e.g. ACK thinning / loss, fewer than 1 ACK per segment arrive at the sender and cwnd does not grow to more than 1 mss when *exiting* recovery.

This may show up in the siftr trace as a Delay of one delayed-ack timeout (typically ~100 ms) after the ssthresh adjustment, and cwnd growing only very slowly. However, the siftr trace does not show any indication of transmitted segments being delayed by the receivers delayed ACK timeout (most packet delta delays are bursts and 1 RTT / 40ms, a few are 2 RTT / 80 ms).

I believe, setting up a dramatically higher packet loss probability on the return path (receiver -> sender) to get a high fraction of ACKs lost (at least 50%) is necessary to trigger the particular instance, which this patch fixes.

Here is the output of the now functional siftr, without and with the patch;

Note that due to the near complete lack of ACKs in the packetdrill script, cwnd never grows, and remains 1 MSS (set to 1000 here for easy human consumption) throughout until the RTO, to really trigger this corner case

i,0x00000000,1547810107.990995,192.168.0.1,8080,192.0.2.1,12988,1073725440,40001,0,33553920,66000,9,6,4,1000,0,1,608,23,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547810107.991020,192.168.0.1,8080,192.0.2.1,12988,1073725440,40001,0,33553920,66000,9,6,4,1000,0,1,608,23,57576,36000,66000,0,36000,0,0,0
o,0x00000000,1547810107.991024,192.168.0.1,8080,192.0.2.1,12988,20000,1000,0,33553920,66000,9,6,4,1000,0,1,537920096,23,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547810107.991071,192.168.0.1,8080,192.0.2.1,12988,20000,1000,0,33553920,66000,9,6,4,1000,0,1,537920096,23,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547810107.992316,192.168.0.1,8080,192.0.2.1,12988,20000,1000,0,33553920,66000,9,6,4,1000,0,1,537920096,23,57576,46000,66000,0,36000,0,0,0
o,0x00000000,1547810107.992338,192.168.0.1,8080,192.0.2.1,12988,20000,1000,0,33553920,66000,9,6,4,1000,0,1,608,23,57576,10000,66000,0,0,0,0,0 # first segment after recovery (flags!)
o,0x00000000,1547810108.226675,192.168.0.1,8080,192.0.2.1,12988,2000,1000,0,33553920,66000,9,6,4,1000,0,1,8801,26,57576,10000,66000,0,1000,0,0,0 # RTO
i,0x00000000,1547810108.227856,192.168.0.1,8080,192.0.2.1,12988,2000,1000,0,33553920,66000,9,6,6,1000,0,1,8800,26,57576,10000,66000,0,1000,0,0,0

In comparison, with the patch:

i,0x00000000,1547812683.728226,192.168.0.1,8080,192.0.2.1,63512,1073725440,40001,0,33553920,66000,9,6,4,1000,24,1,608,24,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547812683.728268,192.168.0.1,8080,192.0.2.1,63512,1073725440,40001,0,33553920,66000,9,6,4,1000,24,1,608,24,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547812683.728288,192.168.0.1,8080,192.0.2.1,63512,1073725440,40001,0,33553920,66000,9,6,4,1000,24,1,608,24,57576,36000,66000,0,36000,0,0,0
o,0x00000000,1547812683.728292,192.168.0.1,8080,192.0.2.1,63512,20000,1000,0,33553920,66000,9,6,4,1000,24,1,537920096,24,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547812683.728314,192.168.0.1,8080,192.0.2.1,63512,20000,1000,0,33553920,66000,9,6,4,1000,24,1,537920096,24,57576,36000,66000,0,36000,0,0,0
i,0x00000000,1547812683.728912,192.168.0.1,8080,192.0.2.1,63512,20000,1000,0,33553920,66000,9,6,4,1000,24,1,537920096,24,57576,46000,66000,0,36000,0,0,0
o,0x00000000,1547812683.728939,192.168.0.1,8080,192.0.2.1,63512,20000,2000,0,33553920,66000,9,6,4,1000,24,1,608,24,57576,10000,66000,0,0,0,0,0 # first segment after recovery (flags!)
o,0x00000000,1547812683.728947,192.168.0.1,8080,192.0.2.1,63512,20000,2000,0,33553920,66000,9,6,4,1000,24,1,608,24,57576,10000,66000,0,1000,0,0,0 # sending 2nd segment after recovery
i,0x00000000,1547812683.829167,192.168.0.1,8080,192.0.2.1,63512,20000,2000,0,33553920,66000,9,6,4,1000,24,1,608,24,57576,10000,66000,0,2000,0,0,0 # ACK (see packetdrill script)
o,0x00000000,1547812683.829183,192.168.0.1,8080,192.0.2.1,63512,20000,4000,0,33553920,66000,9,6,4,1000,57,1,608,32,57576,8000,66000,0,0,0,0,0 # slow start (cwnd <ssthresh)
o,0x00000000,1547812683.829194,192.168.0.1,8080,192.0.2.1,63512,20000,4000,0,33553920,66000,9,6,4,1000,57,1,608,32,57576,8000,66000,0,1000,0,0,0
o,0x00000000,1547812683.829200,192.168.0.1,8080,192.0.2.1,63512,20000,4000,0,33553920,66000,9,6,4,1000,57,1,608,32,57576,8000,66000,0,2000,0,0,0
o,0x00000000,1547812683.829205,192.168.0.1,8080,192.0.2.1,63512,20000,4000,0,33553920,66000,9,6,4,1000,57,1,608,32,57576,8000,66000,0,3000,0,0,0

fixing trailing whitespaces

Harbormaster completed remote builds in B22024: Diff 52985.Jan 18 2019, 3:14 PM

remove siftr patch

Harbormaster completed remote builds in B22025: Diff 52986.Jan 18 2019, 3:19 PM

fixing trailing whitespaces

Harbormaster completed remote builds in B22026: Diff 52987.Jan 18 2019, 3:22 PM

I remember we tried to analyze and improve this and found some unintended consequences between @hiren and @lstewart https://reviews.freebsd.org/D8225 so it got backed out. @lstewart do you remember the details for backing it out?

Looking at D8225, that all seems to be code while in loss recovery. This patch is to restore a sane minimum cwnd once exiting loss recovery - so I don't see how these would be directly related.

Looks good. I think Richard can update more that we recently tested this patch.

This revision is now accepted and ready to land.Jan 31 2019, 4:59 PM

Over the last two or three weeks, we have run a large number of performance regression tests including this patch, in particular again workloads with frequent app-stalls (no additional data to send for about an RTO interval). That type of workload very often causes burst to be transmitted, including self-inflicted packet drops.

This patch showed consistently improvments in througput by approximately 1.2%, independing of CC algorithm (NewReno or Cubic). For streaming type workloads / non-burst type traffic, the have no regressions been observed in our testing.

prepare to land

This revision now requires review to proceed.Feb 5 2019, 7:51 PM

Harbormaster completed remote builds in B22357: Diff 53609.Feb 5 2019, 7:51 PM

lstewart accepted this revision.Mar 28 2019, 1:58 PM

This revision is now accepted and ready to land.Mar 28 2019, 1:58 PM

Lawrence reviewed this during IETF104, Michael volunteered to follup up with the full commit process.

Closed by commit rS347381: Prevent cwnd to collapse down to 1 MSS after exiting recovery. (authored by tuexen). · Explain WhyMay 9 2019, 7:11 AM

This revision was automatically updated to reflect the committed changes.

RFC6582 - prevent cwnd to collapse down to 1 mss after exiting recovery
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

iperf -c r1 -i 10 -t 60

TCP window size: 32.8 KByte (default)

Revision Contents
Changeset List

Diff 52065

sys/netinet/cc/cc_cubic.c

sys/netinet/cc/cc_htcp.c

sys/netinet/cc/cc_newreno.c

RFC6582 - prevent cwnd to collapse down to 1 mss after exiting recoveryClosedPublicActions

Details

Diff Detail

Event Timeline

iperf -c r1 -i 10 -t 60

TCP window size: 32.8 KByte (default)

Revision ContentsChangeset List

Diff 52065

sys/netinet/cc/cc_cubic.c

sys/netinet/cc/cc_htcp.c

sys/netinet/cc/cc_newreno.c

RFC6582 - prevent cwnd to collapse down to 1 mss after exiting recovery
ClosedPublic
Actions

Revision Contents
Changeset List