Details

Reviewers

lstewart
cc
rgrimes

Group Reviewers

transport

Commits

rS362854: MFC r354772: Implement a TCP CUBIC-specific after idle reaction.
rS354772: Implement a tCP CUBIC-specific after idle reaction.

Summary

This patch addresses a very common case of frequent application stalls,
where TCP runs idle and looses the state of the network.

According to RFC5681 section 4.1, the cwnd should be reset to the
Restart Window. However, cwnd will be re-calculated by cubic as soon as
cwnd exceeds ssthresh. Without resetting the cubic epoch, the calculation
is based off the last time an actual congestion event happend - which may
be many tens of thousands of RTTs in the past. This results in excessive
jumps of cwnd or integer overflows. The observable result are huge
traffic burst and self-inflicted drops.

Note: This patch ONLY addresses a very frequent problem case around
after-idle.

Other problematic scenarios revolve around clients signaling a small
receive window, where the session is for a long period limited by that.
Once the client decides to suddenly signal a larger receive window,
the t_cong_last can still be a very long time in the past, resulting
in the same issues as observed before this patch with after-idle sessions.

Test Plan

To invoke the problem case, the session must have expirienced loss (to
pull down ssthresh to a low value). Thereafter, provided there is no
additional data loss but frequent idle periods (sender only sends a small
amount of data, waits for more than RTO interval, and resumes sending),
for each of these episodes the cubic-calculated cwnd grows unbounded.

Once the sender application starts to provide large amounts of data,
a line-rate traffic burst is unavoidable as cwnd may have grown to many
MB or GB in size.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

rscheff created this revision.Jan 25 2019, 1:10 PM

Harbormaster completed remote builds in B22139: Diff 53189.Jan 25 2019, 1:10 PM

rscheff set the repository for this revision to rS FreeBSD src repository - subversion.Jan 25 2019, 1:11 PM

rscheff added a subscriber: transport.

Herald added a subscriber: imp. · View Herald TranscriptJan 25 2019, 1:11 PM

tuexen added inline comments.Jan 25 2019, 1:22 PM

sys/netinet/cc/cc_cubic.c
212 ↗	(On Diff #53189)	Doesn't this require the NewReno module to be loaded? I would write the explicit code here.

rscheff marked an inline comment as done.Jan 25 2019, 1:45 PM

rscheff added inline comments.

sys/netinet/cc/cc_cubic.c
212 ↗	(On Diff #53189)	True. However NewReno is the fallback for actually all modular CCs (e.g. cc_hd, cc_htcp, cc_dctcp, cc_cdg all have direct calls out to NewReno). Also see line 141 above. > suggest an additional change, to put a "MODULE_DEPEND" for cc_newreno at the end of all these cc's?
220 ↗	(On Diff #53189)	excessive line
433 ↗	(On Diff #53189)	missing newline

For the application and client receive window limited case I would like to gather some feedback on my thoughs around this:

The idea would be to detect the "egde" in CCF_CWND_LIMITED (when this flag changes from not-set to set).

Then check the ticks since t_lastcong with the ticks-equivalent of the RTO timeout - if the last congestion event happend longer in the past, then reset t_last_cong to ticks if the session was not cwnd limited for more than one RTO interval... Reason: during the initial slow start phase, the receiver may not ramp up the receive window quickly enough (but certainly within idle period) and that should not immediately cause t_cong_reset. Similar when the application limited interval only lasts a short while...

Does that approach sound reasonable? Or am I missing something?

tuexen added inline comments.Jan 25 2019, 2:33 PM

sys/netinet/cc/cc_cubic.c
212 ↗	(On Diff #53189)	No, then it's fine. Just keep it as it is.

Looks good to me. Will let lstewart@ the final decision...

cc added inline comments.Jan 25 2019, 3:30 PM

sys/netinet/cc/cc_cubic.c
212 ↗	(On Diff #53189)	Do we have a change to reset ssthresh to max(ssthresh, 3/4 * cwnd) before the reset of cwnd for Newreno and all? Like the Linux did here: https://github.com/torvalds/linux/blob/master/net/ipv4/tcp_output.c#L137 At least in Linux, it follows RFC2861 and worked for 14 years since git history can be tracked in Linux-2.6.12-rc2, 2005.

rscheff added inline comments.Jan 25 2019, 4:41 PM

sys/netinet/cc/cc_cubic.c
212 ↗	(On Diff #53189)	That is an optimization we should discuss. Consider though: if the session was app- or receiver-limited, without any loss, this will pull down ssthresh to 3/4 the prior cwnd. If there was loss earlier, ssthresh is likely to be already small (possibly smaller than 3/4 cwnd). So, once the session resumes after idle, it would exit slow-start earlier, and proceed with congestion avoidance from there on, with a lower likelihood of hitting a drop event to cause slow start to stop. If I read that piece correct, it's applied always when after_idle, so would fit better in newreno_after_idle.

update cwnd with w_tf only when it is larger (avoid transmission stalls)

Harbormaster completed remote builds in B22174: Diff 53242.Jan 26 2019, 9:07 AM

typo

Harbormaster completed remote builds in B22175: Diff 53243.Jan 26 2019, 9:16 AM

dangling else; operation validated w/ packetdrill unit testing

Harbormaster completed remote builds in B22180: Diff 53252.Jan 26 2019, 4:37 PM

This packetdrill script can reliably reproduce the observed issue.

cubic-after-idle.pkt9 KBDownload

During IW10, a sufficiently large cwnd (15000) has to be build up for cubic calculations to work in meaningful time, but small enough for the restart window to exceed ssthresh after loss recovery.
A packet loss pulls down ssthresh so the after-idle window will enter congestion avoidance (thus cubic cwnd recalculation).

There is possibly still a small timing dependency though. Without access to a proper server, doing the writes to prime the 8 RTT samples at 1ms intervals is hopefully sufficient to get good enough RTT sample timing.

sys/netinet/cc/cc_cubic.c
167 ↗	(On Diff #53189)	w_tf is sometimes smaller than the existing cwnd. not updating cwnd (shrinking) to avoid transmission stalls

rscheff set the repository for this revision to rS FreeBSD src repository - subversion.Jan 26 2019, 9:34 PM

For sufficiently large values of wmax, the calculation in w_tf suffers
for integer overflows in intermediate calculation steps

rtt = 1
wmax = 831000
smss = 8960

since w_tf
cong int32_t uint64_t
2 668150 668150
3 671124 671124
4 674097 674097
5 677071 677071
6 680045 680045
7 683019 683019
8 685992 685992
9 688966 688966
10 691940 691940
11 694913 694913
12 697887 697887
13 664468 700861
14 667441 703835
15 670415 706808
16 673389 709782
17 676363 712756
18 679336 715729
19 682310 718703
20 685284 721677
21 688257 724651
22 691231 727624
23 694205 730598
24 697179 733572
25 663759 736545
26 666733 739519
27 669707 742493
28 672680 745466
29 675654 748440
30 678628 751414
31 681601 754388
32 684575 757361
33 687549 760335
34 690523 763309
35 693496 766282
36 696470 769256
37 663051 772230
38 666024 775204
39 668998 778177
40 671972 781151
41 674945 784125
42 677919 787098
43 680893 790072
44 683867 793046
45 686840 796020
46 689814 798993
47 692788 801967
48 695761 804941
49 662342 807914
50 665316 810888
51 668289 813862

Harbormaster completed remote builds in B22244: Diff 53369.Jan 29 2019, 11:37 AM

dragging ticks_last_cong along (restricting to INT_MAX)

Harbormaster completed remote builds in B22246: Diff 53371.Jan 29 2019, 1:02 PM

rscheff added inline comments.Jan 29 2019, 1:05 PM

sys/netinet/cc/cc_cubic.c
145 ↗	(On Diff #53371)	This is unlikely, but not impossible
212 ↗	(On Diff #53189)	See D18982

prevent integer overflows of cwnd

Harbormaster completed remote builds in B22260: Diff 53399.Jan 29 2019, 7:36 PM

sys/libkern.h has implicit typecast to (int) for min/max. Need to use ulmin/ulmax for (long)
fixing the limit check from D14141

Harbormaster completed remote builds in B22330: Diff 53561.Feb 3 2019, 6:06 PM

Harbormaster completed remote builds in B22330: Diff 53561.

cc added inline comments.Feb 5 2019, 8:16 PM

sys/netinet/cc/cc_cubic.c
150 ↗	(On Diff #53561)	Is "ticks - INT_MAX <= 0" ?

rscheff added inline comments.Feb 6 2019, 5:18 PM

sys/netinet/cc/cc_cubic.c
150 ↗	(On Diff #53561)	checking the relevant #includes, "ticks" seems to come from sys/kernel.h:71, where it is defined as an int. All other variables in this context are also int. I assume ticks is somewhere incremented HZ times per second somewhere by ++, without a INT_MAX limits check at 2^31 (e.g. rolls over to INT_MIN).

Splitting off the interger overflow changes into another Diff.

rscheff mentioned this in D19118: Add Boundary and Overflow checks in Cubic formulas.Feb 8 2019, 2:26 PM

removed overflow and boundary checks (see D19118 for details)

Harbormaster completed remote builds in B22390: Diff 53688.Feb 8 2019, 2:36 PM

Harbormaster completed remote builds in B22390: Diff 53688.

rscheff added a reviewer: cc.Feb 27 2019, 8:55 PM

Looks good to me. Will let @lstewart make the final decision.

This revision is now accepted and ready to land.Mar 25 2019, 8:30 PM

rscheff added a reviewer: transport.Apr 25 2019, 2:36 PM

I learned that the SCE project group is looking into qualification of TCP Cubic; As R. Grimes is part of that group, adding him as a reviewer, to validate the functional changes to Cubic are sound.

Closed by commit rS354772: Implement a tCP CUBIC-specific after idle reaction. (authored by tuexen). · Explain WhyNov 16 2019, 11:37 AM

This revision was automatically updated to reflect the committed changes.

tuexen added a commit: rS354772: Implement a tCP CUBIC-specific after idle reaction..

rscheff mentioned this in D22438: Improve NewReno after_idle ssthresh per RFC2861 Sec.2.Nov 18 2019, 5:28 PM

tuexen added a commit: rS362854: MFC r354772: Implement a TCP CUBIC-specific after idle reaction..Jul 1 2020, 7:23 PM

Herald added 1 blocking reviewer(s): transport. · View Herald TranscriptJul 1 2020, 7:23 PM

Herald added a subscriber: melifaro. · View Herald Transcript