TCP: Send full Initial Window when Timestamps are in use
ClosedPublic
Actions

Authored by rscheff on Sep 19 2020, 1:33 AM.

Details

Reviewers

cc
jtl
tuexen
thj
rgrimes

Group Reviewers

transport

Commits

rS366565: MFC r366150: TCP: send full initial window when timestamps are in use
rS366150: TCP: send full initial window when timestamps are in use

Summary

While investigating a semi-related issue, an off-by-one
issue got uncovered when TCP timestamps (RFC7323) are in
use.

In D18940 all the initial window calculations were
consolitated, it the dynamic tcp_maxseg() function got
used to correctly arrive at the congestion window to use.

However, the fastpath in tcp_output tries to send out
full segments, and avoid sending partial segments by
comparing against the static t_maxseg variable. That
does not take any TCP header options into consideration.

Thus the tcp_output code considers the last, full size
(including TCP options) segment to be too short and
refrains from sending, resulting in a off-by-one
number of segments sent during the initial window.

Test Plan

Setup a packetdrill script to negotiate the TCP
timestamp option, and then send at least one full
initial window. Prior to the patch, n-1 segments
will be sent, when timestamps are in use, but
n segments are sent without timestamps. With this
fix, n segments are sent in both cases.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

rscheff created this revision.Sep 19 2020, 1:33 AM

Herald added a reviewer: transport. · View Herald TranscriptSep 19 2020, 1:33 AM

Herald added a subscriber: melifaro. · View Herald Transcript

rscheff requested review of this revision.Sep 19 2020, 1:33 AM

Harbormaster completed remote builds in B33637: Diff 77205.Sep 19 2020, 1:33 AM

Packetdrill script to validate correct IW behavior. However, the 1 / 4 / 1 / 2 / 2 segment transmit sequence around the end of the loss recovery episode appears odd?

newreno-iw20-sack-ts-server.pkt9 KBDownload

adding statistics and validating approach
cleaning up branch
add "faster" path for majority of (bulk) transmissions

Verified, if reuse of the optlen calculated for the last segment in a bulk tcp_output was "good enough", as that did address the IW10 off by one with Timestamps issue.

Performed a few iperf3 tests across the public internet and various targets. Notable observations: not only in the packetdrill script, but also in real traffic the slight variance in timing (delay to send out a segment which would have been valid to send one call to tcp_output earlier) could be observed. In traffic testing this happend on sessions with higher loss rates, and not addressing this could negatively affect TCP performance.

However, the vast majority of transmissions would work with the original "len >= t_maxseg" check, so keeping this in as first conditional as fast path.

The "len+optlen" conditional was the initial approach, fixing the IW10 & TS off-by-one observation.

Furthermore "len+TSoptLen" is doing a trinary check for the two currently supported flags, which would have TCP options always present in the header (2nd conditional to jump to send added in this Diff).

Finally the "other reason" is a catch-all for all the other unrelated conditionals leading tcp_output to actually transmit a segment.

Also found, that the trinary with -O2 will actually convert into branchless code (without compiler optimization, the emitted binary will have branches). However, the trinary improves readability over multiplying a boolean result with a constant (that would always result in branchless assembly but actually less readable with -O2).

TCP connection count by state: 204365 len >= t_maxseg 22652 other reason to send 33 len+optlen >= t_maxseg 773 len+TSoptlen >= t_maxseg

Event: September 2020 Bugathon

Harbormaster completed remote builds in B33657: Diff 77235.Sep 19 2020, 9:53 PM

tuexen added inline comments.Sep 19 2020, 10:37 PM

sys/netinet/tcp.h
83 ↗	(On Diff #77235)	Since this is user visible and `PAD` is a generic name, collisions can occur. Can you use a more specific name, like `TCP_OPTLEN_PAD` or so?
92 ↗	(On Diff #77235)	What are these whitespace changes about? Could they be done separately?

renaming macro to reduce collision risk

Harbormaster completed remote builds in B33663: Diff 77245.Sep 20 2020, 8:42 AM

rscheff marked an inline comment as done.Sep 20 2020, 8:43 AM

Another question: what about RACK and BBR? Do they need a similar change?

In D26478#589561, @tuexen wrote:

Another question: what about RACK and BBR? Do they need a similar change?

In my verification testing, there was no difference in RACK (didn't test BBR) w/ or w/o TSopt. Fundamentally, RACK is a packet-oriented stack, while the base stack is Octet-oriented thus some odd iteractions at times. The only observation I did make with RACK and IW10 was, that it would burst 10 segments, and then effectively send 2 or 3 more new data segments with some delay in between them (TLS)?, before falling back to the RTO recovery.

But this t_maxseg vs. tcp_maxseg() sideeffects would be expected to show up in the base stack.

tuexen accepted this revision.Sep 20 2020, 8:00 PM

This revision is now accepted and ready to land.Sep 20 2020, 8:00 PM

Closed by commit rS366150: TCP: send full initial window when timestamps are in use (authored by rscheff). · Explain WhySep 25 2020, 10:38 AM

This revision was automatically updated to reflect the committed changes.

rscheff added a commit: rS366150: TCP: send full initial window when timestamps are in use.

Herald added a subscriber: imp. · View Herald TranscriptSep 25 2020, 10:38 AM

rscheff added a commit: rS366565: MFC r366150: TCP: send full initial window when timestamps are in use.Oct 9 2020, 9:38 AM