Page MenuHomeFreeBSD

Phase 2 to add Proportional Rate Reduction (RFC6937) to FreeBSD
Needs ReviewPublic

Authored by rscheff_gmx.at on Jan 18 2019, 4:45 PM.

Details

Reviewers
hiren
Group Reviewers
transport
Summary

Using Aris Angelogiannopoulos patch from 2013 as base, and
facilitating the improved, single pass accounting in the
SACK scoreboard (D18624). Also switched from sack_newdata
to snd_recover (D18811).

May need to revisit the pipe calculations throughout when
adding RFC6675 pipe calculations as default.

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 22030
Build 21260: arc lint + arc unit

Event Timeline

rscheff_gmx.at created this revision.Jan 18 2019, 4:45 PM

Are you aware of the RACK stack in /usr/src/sys/netinet/tcp_stacks/? It probably does everything you want

Are you aware of the RACK stack in /usr/src/sys/netinet/tcp_stacks/? It probably does everything you want

Yes, I am;

However, adding gradual improvements to the known packet-counting base stack is less problematic over here than switching to an entire different stack with possibly significant different behaviour, that relies on high-speed timers. At least over here, validating the single-session throughput of RACK will be a major obstacle (RACK was tested by the contributors to scale to 10Gbps, while servicing a couple hundred/thousand sessions, not a single session).

While the jury is out which stack can deliver better singel-stream at very high bandwidth and very low latency (1 TOR shallow-buffer switch), why not further improve the base stack and make it less prone to self-inflicted packet loss (see Rationale for Rate-halving and PRR @ IETF). Packet-counting / ACK-clocked stacks possibly deliver better throughput in these environments.

PRR adresses two major issues:
a) self-inflicted packet losses during loss recovery, which can easily lead to RTO. Reason here is, that for the 1st half of the window, 3517 will not send any segments, and for the 2nd half of the window - provided sufficient ACKs (one per delivered segment) are returned - continues to transmit at the original rate - likely before the overloaded buffer got drained enough.
b) if insufficient ACKs are returned, e.g. multiple packet losses, or ACK thinning by the network, or (this is what triggered my work her) a buggy GRO TCP receiver interaction returning one ACK per superframe only, while the receiver has discontinous data (OOO), 3517 will shrink the final cwnd more than it needs to - in the worst case, down to 1 MSS. This will happen, when more than 50% of the ACKs during the loss episode don't make it back to the sender (or, in my case, ~ every 3rd data segment gets ACKs, while OOO).

I worked with rrs to suss out some bugs on this use case when it was new at job-1. I recently tested it to 40gbps single flow on my own pedestrian hardware (1st gen Xeon e3, ixl on one side chelsio T5 on the other) and the CPU usage didn't stand out to me. Yes there will be more timer soft interrupts but there are actually a lot of branching and date movement optimizations in the RACK stack that also help this single flow case. I'm not necessarily saying don't improve the default stack, just that it would probably be a good thing for your customers if you had the totality of the RACK stack improvements in there.