Support estimated RTT for receive buffer auto resizing
ClosedPublic

Authored by smh on Feb 19 2017, 1:21 AM.

Details

Summary

In order to ensure we get good performance on high capacity links with anything more than a few ms of latency, when the sender doesn't support / negotiate timestamps, we need the auto resizing of the receive buffer to support estimated RTT.

This is a first draft attempt at that. It works by using performing the resize check once every smoothed RTT (t_srtt).

It also currently includes some temporary dtrace probes to help monitor the results using the following dtrace script:

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option switchrate=25hz

int last[int];
int last_check[int];

dtrace:::BEGIN
{
       printf(" %12s %12s %9s %7s %9s %9s %9s %9s %12s %12s %12s\n",
           "dt(us)", "dt_check(us)", "rtt_check", "srtt", "newsize", "ticks", "rfbuf_ts", "rtttime", "ack", "seq", "rtseq");
}

tcp:::receive-autoresize
/(args[4]->tcp_sport == 80 || args[4]->tcp_sport == 443) && args[5] != 4/
{
       elapsed = (timestamp - last[args[1]->cs_cid]) / 1000;
       printf(" %12d %12s %9d %7d %9d %9d %9d %9d %12d %12d %12d\n", elapsed, "n/a",
           args[5], args[3]->tcps_srtt, args[0], args[2], args[3]->tcps_rfbuf_ts, args[3]->tcps_rtttime, args[4]->tcp_ack, args[4]->tcp_seq, args[3]->tcps_rtseq);
       last[args[1]->cs_cid] = timestamp;
}

tcp:::receive-autoresize
/(args[4]->tcp_sport == 80 || args[4]->tcp_sport == 443) && args[5] == 4/
{
       elapsed = (timestamp - last[args[1]->cs_cid]) / 1000;
       elapsed_check = (timestamp - last_check[args[1]->cs_cid]) / 1000;
       printf(" %12d %12d %9d %7d %9d %9d %9d %9d %12d %12d %12d\n", elapsed, elapsed_check,
           args[5], args[3]->tcps_srtt, args[0], args[2], args[3]->tcps_rfbuf_ts, args[3]->tcps_rtttime, args[4]->tcp_ack, args[4]->tcp_seq, args[3]->tcps_rtseq);
       last[args[1]->cs_cid] = timestamp;
       last_check[args[1]->cs_cid] = timestamp;
}

Initial test results look good with S3 downloads at ~17ms latency on a 1Gbps connection jumping from ~3MB/s to ~80MB/s.

Test Plan

Download files from Amazon S3 (which doesn't negotiate timestamps) and confirm our download speeds are as good as or better than Linux performing the same download.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.
smh retitled this revision from to Support estimated RTT for receive buffer auto resizing.Feb 19 2017, 1:21 AM
smh updated this object.
smh edited the test plan for this revision. (Show Details)
smh updated this object.Feb 19 2017, 1:22 AM
smh edited the test plan for this revision. (Show Details)
smh added a comment.Feb 19 2017, 1:39 AM

For reference fastpath version of this hasn't been done, which need also need the same changes to:

tcp_stacks/fastpath.c
smh updated this object.Feb 19 2017, 1:43 AM
smh updated this revision to Diff 25409.Feb 20 2017, 9:08 AM

Disabled reset receive buffer auto scaling when not in bulk receive mode, which gives an extra 20% performance increase, bring it closer to Linux.

I've left the code commented out for now and added a dtrace probe, while I invesitage the interaction with this point in the code, but the performance improvement definitely indicates this is not the optimal thing to do.

smh added inline comments.Feb 20 2017, 9:13 AM
sys/netinet/tcp_input.c
1891 ↗(On Diff #25409)

Given that timestamps or estimates are used to calculate srtt is there any need for separate code paths here or is it adding complexity with no real gain?

smh added a reviewer: hiren.Feb 22 2017, 9:17 AM

Apologies for the delay in getting to this.

So, to go back to first principles for a second, the basic goal of RX sockbuf autoscaling is to avoid the window advertised to the sender becoming the limiting factor in the sender's ability to fully utilise the pipe. The sender's theoretically ideal window at any given moment will be == the path's BDP, which the receiver makes no attempt to infer/measure, so the current algorithm takes the simplistic approach of only ensuring that the largest window that can be advertised (== RX sb_hiwat) is at least 1/8 or ~12% greater than the connection's largest ever measured send window (subject to the various autoscaling and sockbuf limits). The sender's window is measured approximately once per RTT by counting bytes received over the current RTT. The algorithm makes no attempt to deal with other causes of lost transmission opportunity (e.g. if the receiving app is not adequately servicing the RX sockbuf every RTT), which arguably should be improved given the goal of this feature.

Regarding the use of SRTT to measure send window: given the imprecise requirements for the current algorithm, I suspect it's entirely reasonable to ditch the dual RTT measurement code paths and use SRTT exclusively even though it's less correct to do so. If SRTT is larger than raw RTT, we'll have a longer measurement period and likely count more bytes, which may cause a premature increase. Conversely, we may delay an increase if SRTT < raw. Neither is a show stopper.

Regarding bulk receive mode reset: it seems that perhaps the whole autoscaling code block is in the wrong place stuffed into the header prediction "fast path". Bottom line is that to trigger an increase, the connection has to be in "bulk receive mode" on account of the send window being measured as > 7/8 RX sockbuf. I also can't think of a reason why bi-directional connections shouldn't also scale their RX sockbuf.

As an aside, the "to.to_tsecr - tp->rfbuf_ts < hz" check in the TOF_TS path is completely bogus for connections with legitimate RTTs >1s (which would stand to benefit a lot from RX sockbuf scaling).

smh updated this revision to Diff 26412.Mar 19 2017, 12:29 PM

Eliminated timestamp specific code path resulting in a simplified yet still effective single SRTT path.

Implemented changes for fastpath.

Removed temporary Dtrace probes.

smh added a comment.EditedMar 19 2017, 12:37 PM

Updated dtrace:

dtrace
#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option switchrate=25hz

int last[int];
int last_check[int];

dtrace:::BEGIN
{
       printf(" %12s %7s %9s %9s %9s %12s %12s %12s\n",
           "dt(us)",  "srtt", "newsize", "rfbuf_ts", "rtttime", "ack", "seq", "rtseq");
}

tcp:::receive-autoresize
/(args[4]->tcp_sport == 80 || args[4]->tcp_sport == 443)/
{
       elapsed = (timestamp - last[args[1]->cs_cid]) / 1000;
       printf(" %12d %7d %9d %9d %9d %12d %12d %12d\n", elapsed,
           args[3]->tcps_srtt, args[5], args[3]->tcps_rfbuf_ts, args[3]->tcps_rtttime, args[4]->tcp_ack, args[4]->tcp_seq, args[3]->tcps_rtseq);
       last[args[1]->cs_cid] = timestamp;
}
smh marked an inline comment as done.Mar 19 2017, 12:38 PM
smh updated this revision to Diff 26416.Mar 19 2017, 12:43 PM

Corrected dtrace autoresize mbuf parameter definition.

smh added a comment.EditedMar 19 2017, 12:50 PM

Thanks for the feedback Lawrence, based on that I've updated to use just the SRTT check, added the fastpath version and generally cleaned up.

Hopefully this captures everything, putting it in a good place to merge?

emaste added a subscriber: emaste.Mar 24 2017, 3:51 PM
smh updated this revision to Diff 26885.Mar 31 2017, 4:27 PM

Extract common logic to tcp_autorcvbuf as suggested by lstewart.

smh added a comment.Apr 7 2017, 5:56 PM

Lawrence is now happy with this in is current form, so just a prod to see if I get any comments for or against from transport members / gnn?

gnn accepted this revision.Apr 7 2017, 6:00 PM

I like the use of DTrace as well. Please add the script as another commit.

This revision is now accepted and ready to land.Apr 7 2017, 6:00 PM
lstewart added inline comments.Apr 7 2017, 7:26 PM
sys/netinet/tcp_input.c
1532 ↗(On Diff #26885)

Note that t_srtt is in units of hz ticks scaled by TCP_RTT_SHIFT, and tcp_ts_getticks() returns (confusingly) "timestamp ticks" which are fixed at 1ms i.e. if hz != 1000, the units of left and right hand sides differ. You need to tweak one of the sides accordingly.

smh updated this revision to Diff 27246.Apr 8 2017, 10:28 PM

Convert timestamp to ticks to ensure that comparisons are correct for
hz != 1000.

This revision now requires review to proceed.Apr 8 2017, 10:28 PM
smh marked an inline comment as done.Apr 8 2017, 10:29 PM
gnn accepted this revision.Apr 9 2017, 10:45 PM
This revision is now accepted and ready to land.Apr 9 2017, 10:45 PM
This revision was automatically updated to reflect the committed changes.