ECN and CUBIC do not play well together in very congested networks
ClosedPublic
Actions

Authored by rscheff on Jan 24 2020, 9:14 PM.

Details

Reviewers

lstewart
cc
tuexen
rgrimes
rrs
kbowling
jtl
hiren

Group Reviewers

transport

Commits

rS361029: MFC r360491: Introduce a lower bound of 2 MSS to TCP Cubic.
rS360491: Introduce a lower bound of 2 MSS to TCP Cubic.

Summary

It was observed, that in very congested networks (multiple CE marks
per RTT, for multiple (>10) windows (RTTs), cwnd would collapse down to 1 byte
with cubic.

Furthermore, the stack would start transmitting only using the persist timer,
which has another implication that the newly sent data is erraneously sent out
without the ECT0 codepoint.

Apparently, some (linux?) clients do not process the TCP ECN header flags, unless
the IP ECN codepoint is either ECT0, ECT1 or CE. Thus the slowly sent out new
data with IP ECN codepoint "Not ECT", and TCP ECN header flag "CWR" does not
unlatch the clients ECE flag, resulting in a deadlock where cwnd remains for
lengthy periods (~5 min) at 1 byte, before the receiver eventually clears ECE
and normal cwnd processing can resume.

In short, when cubic cc enters a new round of CC_ECN, the function
cubic_update_ssthresh effectively sets cwnd. But that function does NOT have
a lower bound. Thus cwnd can actually become as small as 1 Byte.

This conditional in tcp_output here:

!((tp->t_flags & TF_FORCEDATA) && len == 1))

should probably prevent RTO- or Persist Timer-triggered transmissions
(window probes (?) etc) from having the ECT0 codepoint set.

I suspect it wouldn't be necessary to fix that conditional though, if
cwnd is prevented from collapsing to less than 2 bytes...

Finally, there was an oversight with a (typecasting in kernel) max
instead of ulmax assignment, where cwnd was also not properly checked
for a lower bound.

Test Plan

See attached packetdrill scripts for a repro of the issue, and verification of the fix

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

rscheff created this revision.Jan 24 2020, 9:14 PM

Harbormaster completed remote builds in B28901: Diff 67264.Jan 24 2020, 9:14 PM

Test script to provoke the issue:

test-cwnd-collapse.pkt6 KBDownload

And this script validates that cwnd doesn't collapse below 2 MSS:

test-cwnd-fixed-collapse.pkt5 KBDownload

Why are you enforcing a lower bond of two MSS instead of one MSS?

https://tools.ietf.org/html/rfc6582 Appendix B gives the rationale for cwnd to have a lower bound of 2 MSS (addresses the potential additional delay due to a running delayed ACK timeout on the receiver side).

Note that during RTO (exteme congestion), the cwnd can still be reduced to 1 MSS.

tuexen added inline comments.Jan 28 2020, 8:16 AM

sys/netinet/cc/cc_cubic.c
371 ↗	(On Diff #67264)	Don't you want to use max instead of ulmax? snd_cwnd is of type uint32_t.
437 ↗	(On Diff #67264)	Shouldn't this be uint32_t?
448 ↗	(On Diff #67264)	If ssthresh is of type uint32_t, no cast is needed.
450 ↗	(On Diff #67264)	Isn't max here what you want to use instead of ulmax?

use int32_t (kernel) max instead of ulmax

Harbormaster completed remote builds in B29055: Diff 67532.Jan 30 2020, 8:38 PM

addressed all of Michael's comments

cc added inline comments.Feb 24 2020, 6:33 PM

sys/netinet/cc/cc_cubic.c
369–371 ↗	(On Diff #67532)	Does your test cover this line of change or was the old line reproduced?

Herald added a subscriber: melifaro. · View Herald TranscriptFeb 24 2020, 6:33 PM

rscheff added inline comments.Feb 25 2020, 11:58 AM

sys/netinet/cc/cc_cubic.c
369–371 ↗	(On Diff #67532)	This line fixes two issues: Reduction of cwnd on continously set ECE flag from the client down to1 byte (probably should have been 1 MSS, but no multiplication by MSS in the checked-in code finally), to have a lower bound of 2 MSS (in line with rfc6582, see appendix B for the rationale) and an overflow if max_cwnd is close to its maximum (INT_MAX/2) and the adjustment by cubic_shift could result in an integer overflow. The former was unit-tested using packetdrill. The latter was not actually observed in traces so far, but addressed for completeness and correctness (without explicitly testing for it).