Page MenuHomeFreeBSD

Phase 2 to add Proportional Rate Reduction (RFC6937) to FreeBSD
Needs ReviewPublic

Authored by rscheff on Jan 18 2019, 4:45 PM.

Details

Summary

Using Aris Angelogiannopoulos patch from 2013 as base, and
facilitating the improved, single pass accounting in the
SACK scoreboard (D18624). Also switched from sack_newdata
to snd_recover (D18811).

May need to revisit the pipe calculations throughout when
adding RFC6675 pipe calculations as default.

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 33580
Build 30832: arc lint + arc unit

Event Timeline

Are you aware of the RACK stack in /usr/src/sys/netinet/tcp_stacks/? It probably does everything you want

Are you aware of the RACK stack in /usr/src/sys/netinet/tcp_stacks/? It probably does everything you want

Yes, I am;

However, adding gradual improvements to the known packet-counting base stack is less problematic over here than switching to an entire different stack with possibly significant different behaviour, that relies on high-speed timers. At least over here, validating the single-session throughput of RACK will be a major obstacle (RACK was tested by the contributors to scale to 10Gbps, while servicing a couple hundred/thousand sessions, not a single session).

While the jury is out which stack can deliver better singel-stream at very high bandwidth and very low latency (1 TOR shallow-buffer switch), why not further improve the base stack and make it less prone to self-inflicted packet loss (see Rationale for Rate-halving and PRR @ IETF). Packet-counting / ACK-clocked stacks possibly deliver better throughput in these environments.

PRR adresses two major issues:
a) self-inflicted packet losses during loss recovery, which can easily lead to RTO. Reason here is, that for the 1st half of the window, 3517 will not send any segments, and for the 2nd half of the window - provided sufficient ACKs (one per delivered segment) are returned - continues to transmit at the original rate - likely before the overloaded buffer got drained enough.
b) if insufficient ACKs are returned, e.g. multiple packet losses, or ACK thinning by the network, or (this is what triggered my work her) a buggy GRO TCP receiver interaction returning one ACK per superframe only, while the receiver has discontinous data (OOO), 3517 will shrink the final cwnd more than it needs to - in the worst case, down to 1 MSS. This will happen, when more than 50% of the ACKs during the loss episode don't make it back to the sender (or, in my case, ~ every 3rd data segment gets ACKs, while OOO).

I worked with rrs to suss out some bugs on this use case when it was new at job-1. I recently tested it to 40gbps single flow on my own pedestrian hardware (1st gen Xeon e3, ixl on one side chelsio T5 on the other) and the CPU usage didn't stand out to me. Yes there will be more timer soft interrupts but there are actually a lot of branching and date movement optimizations in the RACK stack that also help this single flow case. I'm not necessarily saying don't improve the default stack, just that it would probably be a good thing for your customers if you had the totality of the RACK stack improvements in there.

This is a visual code read only review, I have not compiled or tested the code.

sys/netinet/tcp_input.c
2595

Mixing white space fixes makes more lines of code to review. White space fixes should just be committed individually and ASAP.

2628

Duplicated code from below copied into above should be reduciable by use of complicated if expression, which would then more closely match the comment above.

if (V_tcp_do_prf || tp->t_flags & TF_SACK_PERMIT) {

  • minimal diff for prr only

Note that the above diff will *not* compile by itself, as it is dependent on D18624. It should therefore only be applies after the necessary changes to improve the SACK scoreboard accounting and supporting code there.

This patch needs additional unit testing, to verify that this patch does what it claims to do, and have no detrimental side effects.

  • Update Diff to Head to apply cleanly
  • use dynamic tcp_maxseg(tp) instead of t_maxseg in PRR calculations

Given the way things have gone, I'd like to see this in FreeBSD 13. I don't have the time to test and reason through sample streams with this change to give a formal approval, so consider this a concept ACK.

FWIW: I asked Liang Tian <l.tian.email@gmail.com> to verify in his special test scenario if this implementation of PRR performs as expected.

The environment simulated would drop every other, or every 3rd segment, and also thin out (drop) ACKs on the return path. With normal 3517 SACK loss recovery, this would quickly lead to the base stack "running out" of ACKs to clock out new/retransmitted data. Ultimately, one or more RTOs would be required to recover in that situation.

Liang reported (as internal testing by Aris Angelogiannopoulos during his master thesis in 2013) that this patch will clock out data "appropriately" (still bundled together, when ACK thinning occurs, so RACK + pacing would still be superior), and avoid RTOs as expected.

Discussed the next steps around this Diff in todays call:

  • the PRR functionality is off by default (no change expected in the base stack behavior, unless explicitly enabled)
  • the functionality has been validated to work as expected in the scenario this is designed to address (ACK thinning during SACK loss recovery)

Unless anyone objects in the next 2 weeks (3rd Dec), I will go ahead in committing this Diff, with the defaults turned off.

Following this:

  • 6937bis is to document a heuristic allowing the automatic toggling between the two PRR modes, based on information in the ACKs
  • Once the heuristic works satisfactory, and the 13.0 release has been forked, I propose to change the default of the PRR mechanism to on.

@rscheff I'm not a fan of 13 shipping with this off if there are no known issues. The default stack is what most large and small freebsd send heavy users are on, and many larger commercial users are unaware of tuning parameters or the alternate stacks, so putting our best foot forward will help retain these users in the timeframe of 13.0. What do you think about toggling it on right before 13 branches, and it can be disabled if any PRs arise during the release process?

Well, there are "known issues" insofar, as that PRR-SS (the non-conservative variant) is known to sometimes perform poorly with token bucket rate limiters - where the PRR_conservative mechanism is a better fit.

Short of the heuristic promised to be documented in the -bis RFC, I would think we could enable PRR in conservative mode and gather some feedback, deciding what defaults we want to ship 13.0 with.

(In todays transport call, we (@jtl and I) discussed exactly this, but erred on the more cautious commit,default off, and change default after 13.0 released...)

However, I know of only "artificial" tests (packetdrill, the master thesis scenarios from Aris in 2013, and the extreme ack thinning scenario by Liang Tang) looking at the (correct) microscopic behavior of this code; Having said this, PRR is not expected to perform worse than RFC3517 SACK loss recovery (pending any unfound bugs so far).

If people run into bugs with it, disabling is straightforward and we can rapidly diff the default to off...

My opinion is based on previous stack changes where there are 3 common cases I observed:

  1. A general improvement; ship it! Everyone benefits.
  2. A fatal error like connection drops or kernel panics; revert it! Since your internal testing has been performed, this kind of thing can still happen and it's ok to use HEAD to gather final testing and feedback, and back out if something was missed because it is unknown.
  3. A subtle performance issue but no impact to connectivity; fix it during the branching period or issue an EN. This doesn't necessarily fall onto you, for instance if it is a peculiar use case it might be an opportunity for someone else to diagnose and get exposed to the stack and help out. If it can be totally isolated with the sysctl, it is a very simple workaround for special cases.

Without sufficient end user testing it seems like this would fall into 1) with a minority chance of 3). I think using HEAD to test will help determine whether that is true and should be used to guide the enablement or disablement for -RELEASE.