Page MenuHomeFreeBSD

ECN and CUBIC do not play well together in very congested networks
ClosedPublic

Authored by rscheff on Jan 24 2020, 9:14 PM.
Tags
None
Referenced Files
F107253649: D23353.diff
Sat, Jan 11, 7:00 PM
F107201886: D23353.diff
Sat, Jan 11, 2:41 PM
Unknown Object (File)
Sun, Dec 22, 7:39 PM
Unknown Object (File)
Sun, Dec 15, 2:46 AM
Unknown Object (File)
Sun, Dec 15, 12:02 AM
Unknown Object (File)
Nov 16 2024, 1:35 AM
Unknown Object (File)
Nov 15 2024, 10:22 PM
Unknown Object (File)
Oct 4 2024, 4:27 PM
Subscribers

Details

Summary

It was observed, that in very congested networks (multiple CE marks
per RTT, for multiple (>10) windows (RTTs), cwnd would collapse down to 1 byte
with cubic.

Furthermore, the stack would start transmitting only using the persist timer,
which has another implication that the newly sent data is erraneously sent out
without the ECT0 codepoint.

Apparently, some (linux?) clients do not process the TCP ECN header flags, unless
the IP ECN codepoint is either ECT0, ECT1 or CE. Thus the slowly sent out new
data with IP ECN codepoint "Not ECT", and TCP ECN header flag "CWR" does not
unlatch the clients ECE flag, resulting in a deadlock where cwnd remains for
lengthy periods (~5 min) at 1 byte, before the receiver eventually clears ECE
and normal cwnd processing can resume.

In short, when cubic cc enters a new round of CC_ECN, the function
cubic_update_ssthresh effectively sets cwnd. But that function does NOT have
a lower bound. Thus cwnd can actually become as small as 1 Byte.

This conditional in tcp_output here:

!((tp->t_flags & TF_FORCEDATA) && len == 1))

should probably prevent RTO- or Persist Timer-triggered transmissions
(window probes (?) etc) from having the ECT0 codepoint set.

I suspect it wouldn't be necessary to fix that conditional though, if
cwnd is prevented from collapsing to less than 2 bytes...

Finally, there was an oversight with a (typecasting in kernel) max
instead of ulmax assignment, where cwnd was also not properly checked
for a lower bound.

Test Plan

See attached packetdrill scripts for a repro of the issue, and verification of the fix

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 29055
Build 27010: arc lint + arc unit

Event Timeline

Test script to provoke the issue:

And this script validates that cwnd doesn't collapse below 2 MSS:

Why are you enforcing a lower bond of two MSS instead of one MSS?

https://tools.ietf.org/html/rfc6582 Appendix B gives the rationale for cwnd to have a lower bound of 2 MSS (addresses the potential additional delay due to a running delayed ACK timeout on the receiver side).

Note that during RTO (exteme congestion), the cwnd can still be reduced to 1 MSS.

sys/netinet/cc/cc_cubic.c
371

Don't you want to use max instead of ulmax? snd_cwnd is of type uint32_t.

437

Shouldn't this be uint32_t?

448

If ssthresh is of type uint32_t, no cast is needed.

450

Isn't max here what you want to use instead of ulmax?

use int32_t (kernel) max instead of ulmax

addressed all of Michael's comments

sys/netinet/cc/cc_cubic.c
369–371

Does your test cover this line of change or was the old line reproduced?

sys/netinet/cc/cc_cubic.c
369–371

This line fixes two issues: Reduction of cwnd on continously set ECE flag from the client down to1 byte (probably should have been 1 MSS, but no multiplication by MSS in the checked-in code finally), to have a lower bound of 2 MSS (in line with rfc6582, see appendix B for the rationale) and an overflow if max_cwnd is close to its maximum (INT_MAX/2) and the adjustment by cubic_shift could result in an integer overflow.

The former was unit-tested using packetdrill.

The latter was not actually observed in traces so far, but addressed for completeness and correctness (without explicitly testing for it).

This revision is now accepted and ready to land.Apr 30 2020, 10:06 AM
This revision was automatically updated to reflect the committed changes.