Page MenuHomeFreeBSD

TCP: Send full Initial Window when Timestamps are in use
ClosedPublic

Authored by rscheff on Sep 19 2020, 1:33 AM.

Details

Summary

While investigating a semi-related issue, an off-by-one
issue got uncovered when TCP timestamps (RFC7323) are in
use.

In D18940 all the initial window calculations were
consolitated, it the dynamic tcp_maxseg() function got
used to correctly arrive at the congestion window to use.

However, the fastpath in tcp_output tries to send out
full segments, and avoid sending partial segments by
comparing against the static t_maxseg variable. That
does not take any TCP header options into consideration.

Thus the tcp_output code considers the last, full size
(including TCP options) segment to be too short and
refrains from sending, resulting in a off-by-one
number of segments sent during the initial window.

Test Plan

Setup a packetdrill script to negotiate the TCP
timestamp option, and then send at least one full
initial window. Prior to the patch, n-1 segments
will be sent, when timestamps are in use, but
n segments are sent without timestamps. With this
fix, n segments are sent in both cases.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Packetdrill script to validate correct IW behavior. However, the 1 / 4 / 1 / 2 / 2 segment transmit sequence around the end of the loss recovery episode appears odd?

  • adding statistics and validating approach
  • cleaning up branch
  • add "faster" path for majority of (bulk) transmissions

    Verified, if reuse of the optlen calculated for the last segment in a bulk tcp_output was "good enough", as that did address the IW10 off by one with Timestamps issue.

    Performed a few iperf3 tests across the public internet and various targets. Notable observations: not only in the packetdrill script, but also in real traffic the slight variance in timing (delay to send out a segment which would have been valid to send one call to tcp_output earlier) could be observed. In traffic testing this happend on sessions with higher loss rates, and not addressing this could negatively affect TCP performance.

    However, the vast majority of transmissions would work with the original "len >= t_maxseg" check, so keeping this in as first conditional as fast path.

    The "len+optlen" conditional was the initial approach, fixing the IW10 & TS off-by-one observation.

    Furthermore "len+TSoptLen" is doing a trinary check for the two currently supported flags, which would have TCP options always present in the header (2nd conditional to jump to send added in this Diff).

    Finally the "other reason" is a catch-all for all the other unrelated conditionals leading tcp_output to actually transmit a segment.

    Also found, that the trinary with -O2 will actually convert into branchless code (without compiler optimization, the emitted binary will have branches). However, the trinary improves readability over multiplying a boolean result with a constant (that would always result in branchless assembly but actually less readable with -O2).

    TCP connection count by state: 204365 len >= t_maxseg 22652 other reason to send 33 len+optlen >= t_maxseg 773 len+TSoptlen >= t_maxseg

Event: September 2020 Bugathon

sys/netinet/tcp.h
83 ↗(On Diff #77235)

Since this is user visible and PAD is a generic name, collisions can occur. Can you use a more specific name, like TCP_OPTLEN_PAD or so?

92 ↗(On Diff #77235)

What are these whitespace changes about? Could they be done separately?

  • renaming macro to reduce collision risk

Another question: what about RACK and BBR? Do they need a similar change?

Another question: what about RACK and BBR? Do they need a similar change?

In my verification testing, there was no difference in RACK (didn't test BBR) w/ or w/o TSopt. Fundamentally, RACK is a packet-oriented stack, while the base stack is Octet-oriented thus some odd iteractions at times. The only observation I did make with RACK and IW10 was, that it would burst 10 segments, and then effectively send 2 or 3 more new data segments with some delay in between them (TLS)?, before falling back to the RTO recovery.

But this t_maxseg vs. tcp_maxseg() sideeffects would be expected to show up in the base stack.

This revision is now accepted and ready to land.Sep 20 2020, 8:00 PM