Page MenuHomeFreeBSD

Implement Cubic-specific After-Idle reaction

Authored by rscheff on Jan 25 2019, 1:10 PM.



This patch addresses a very common case of frequent application stalls,
where TCP runs idle and looses the state of the network.

According to RFC5681 section 4.1, the cwnd should be reset to the
Restart Window. However, cwnd will be re-calculated by cubic as soon as
cwnd exceeds ssthresh. Without resetting the cubic epoch, the calculation
is based off the last time an actual congestion event happend - which may
be many tens of thousands of RTTs in the past. This results in excessive
jumps of cwnd or integer overflows. The observable result are huge
traffic burst and self-inflicted drops.

Note: This patch ONLY addresses a very frequent problem case around

Other problematic scenarios revolve around clients signaling a small
receive window, where the session is for a long period limited by that.
Once the client decides to suddenly signal a larger receive window,
the t_cong_last can still be a very long time in the past, resulting
in the same issues as observed before this patch with after-idle sessions.

Test Plan

To invoke the problem case, the session must have expirienced loss (to
pull down ssthresh to a low value). Thereafter, provided there is no
additional data loss but frequent idle periods (sender only sends a small
amount of data, waits for more than RTO interval, and resumes sending),
for each of these episodes the cubic-calculated cwnd grows unbounded.

Once the sender application starts to provide large amounts of data,
a line-rate traffic burst is unavoidable as cwnd may have grown to many
MB or GB in size.

Diff Detail

rS FreeBSD src repository
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

rscheff added a subscriber: transport.
212 ↗(On Diff #53189)

Doesn't this require the NewReno module to be loaded? I would write the explicit code here.

rscheff added inline comments.
212 ↗(On Diff #53189)


However NewReno is the fallback for actually all modular CCs (e.g. cc_hd, cc_htcp, cc_dctcp, cc_cdg all have direct calls out to NewReno).

Also see line 141 above.

> suggest an additional change, to put a "MODULE_DEPEND" for cc_newreno at the end of all these cc's?

220 ↗(On Diff #53189)

excessive line

433 ↗(On Diff #53189)

missing newline

For the application and client receive window limited case I would like to gather some feedback on my thoughs around this:

The idea would be to detect the "egde" in CCF_CWND_LIMITED (when this flag changes from not-set to set).

Then check the ticks since t_lastcong with the ticks-equivalent of the RTO timeout - if the last congestion event happend longer in the past, then reset t_last_cong to ticks if the session was not cwnd limited for more than one RTO interval... Reason: during the initial slow start phase, the receiver may not ramp up the receive window quickly enough (but certainly within idle period) and that should not immediately cause t_cong_reset. Similar when the application limited interval only lasts a short while...

Does that approach sound reasonable? Or am I missing something?

212 ↗(On Diff #53189)

No, then it's fine. Just keep it as it is.

Looks good to me. Will let lstewart@ the final decision...

212 ↗(On Diff #53189)

Do we have a change to reset ssthresh to max(ssthresh, 3/4 * cwnd) before the reset of cwnd for Newreno and all?

Like the Linux did here:

At least in Linux, it follows RFC2861 and worked for 14 years since git history can be tracked in Linux-2.6.12-rc2, 2005.

212 ↗(On Diff #53189)

That is an optimization we should discuss. Consider though: if the session was app- or receiver-limited, without any loss, this will pull down ssthresh to 3/4 the prior cwnd. If there was loss earlier,
ssthresh is likely to be already small (possibly smaller than 3/4 cwnd).

So, once the session resumes after idle, it would exit slow-start earlier, and proceed with congestion avoidance from there on, with a lower likelihood of hitting a drop event to cause slow start to stop.

If I read that piece correct, it's applied always when after_idle, so would fit better in newreno_after_idle.

  • update cwnd with w_tf only when it is larger (avoid transmission stalls)
  • dangling else; operation validated w/ packetdrill unit testing

This packetdrill script can reliably reproduce the observed issue.

During IW10, a sufficiently large cwnd (15000) has to be build up for cubic calculations to work in meaningful time, but small enough for the restart window to exceed ssthresh after loss recovery.
A packet loss pulls down ssthresh so the after-idle window will enter congestion avoidance (thus cubic cwnd recalculation).

There is possibly still a small timing dependency though. Without access to a proper server, doing the writes to prime the 8 RTT samples at 1ms intervals is hopefully sufficient to get good enough RTT sample timing.

167 ↗(On Diff #53189)

w_tf is sometimes smaller than the existing cwnd. not updating cwnd (shrinking) to avoid transmission stalls

For sufficiently large values of wmax, the calculation in w_tf suffers
for integer overflows in intermediate calculation steps

rtt = 1
wmax = 831000
smss = 8960

since w_tf
cong int32_t uint64_t
2 668150 668150
3 671124 671124
4 674097 674097
5 677071 677071
6 680045 680045
7 683019 683019
8 685992 685992
9 688966 688966
10 691940 691940
11 694913 694913
12 697887 697887
13 664468 700861
14 667441 703835
15 670415 706808
16 673389 709782
17 676363 712756
18 679336 715729
19 682310 718703
20 685284 721677
21 688257 724651
22 691231 727624
23 694205 730598
24 697179 733572
25 663759 736545
26 666733 739519
27 669707 742493
28 672680 745466
29 675654 748440
30 678628 751414
31 681601 754388
32 684575 757361
33 687549 760335
34 690523 763309
35 693496 766282
36 696470 769256
37 663051 772230
38 666024 775204
39 668998 778177
40 671972 781151
41 674945 784125
42 677919 787098
43 680893 790072
44 683867 793046
45 686840 796020
46 689814 798993
47 692788 801967
48 695761 804941
49 662342 807914
50 665316 810888
51 668289 813862

  • dragging ticks_last_cong along (restricting to INT_MAX)
212 ↗(On Diff #53189)

See D18982

145 ↗(On Diff #53371)

This is unlikely, but not impossible

  • prevent integer overflows of cwnd
  • sys/libkern.h has implicit typecast to (int) for min/max. Need to use ulmin/ulmax for (long)
  • fixing the limit check from D14141
150 ↗(On Diff #53561)

Is "ticks - INT_MAX <= 0" ?

150 ↗(On Diff #53561)

checking the relevant #includes, "ticks" seems to come from sys/kernel.h:71, where it is defined as an int. All other variables in this context are also int.

I assume ticks is somewhere incremented HZ times per second somewhere by ++, without a INT_MAX limits check at 2^31 (e.g. rolls over to INT_MIN).

Splitting off the interger overflow changes into another Diff.

  • removed overflow and boundary checks (see D19118 for details)

Looks good to me. Will let @lstewart make the final decision.

This revision is now accepted and ready to land.Mar 25 2019, 8:30 PM

I learned that the SCE project group is looking into qualification of TCP Cubic; As R. Grimes is part of that group, adding him as a reviewer, to validate the functional changes to Cubic are sound.