Details

Reviewers

hiren
lstewart
gnn

Group Reviewers

transport

Commits

rS316676: Use estimated RTT for receive buffer auto resizing instead of timestamps

Summary

In order to ensure we get good performance on high capacity links with anything more than a few ms of latency, when the sender doesn't support / negotiate timestamps, we need the auto resizing of the receive buffer to support estimated RTT.

This is a first draft attempt at that. It works by using performing the resize check once every smoothed RTT (t_srtt).

It also currently includes some temporary dtrace probes to help monitor the results using the following dtrace script:

#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option switchrate=25hz

int last[int];
int last_check[int];

dtrace:::BEGIN
{
       printf(" %12s %12s %9s %7s %9s %9s %9s %9s %12s %12s %12s\n",
           "dt(us)", "dt_check(us)", "rtt_check", "srtt", "newsize", "ticks", "rfbuf_ts", "rtttime", "ack", "seq", "rtseq");
}

tcp:::receive-autoresize
/(args[4]->tcp_sport == 80 || args[4]->tcp_sport == 443) && args[5] != 4/
{
       elapsed = (timestamp - last[args[1]->cs_cid]) / 1000;
       printf(" %12d %12s %9d %7d %9d %9d %9d %9d %12d %12d %12d\n", elapsed, "n/a",
           args[5], args[3]->tcps_srtt, args[0], args[2], args[3]->tcps_rfbuf_ts, args[3]->tcps_rtttime, args[4]->tcp_ack, args[4]->tcp_seq, args[3]->tcps_rtseq);
       last[args[1]->cs_cid] = timestamp;
}

tcp:::receive-autoresize
/(args[4]->tcp_sport == 80 || args[4]->tcp_sport == 443) && args[5] == 4/
{
       elapsed = (timestamp - last[args[1]->cs_cid]) / 1000;
       elapsed_check = (timestamp - last_check[args[1]->cs_cid]) / 1000;
       printf(" %12d %12d %9d %7d %9d %9d %9d %9d %12d %12d %12d\n", elapsed, elapsed_check,
           args[5], args[3]->tcps_srtt, args[0], args[2], args[3]->tcps_rfbuf_ts, args[3]->tcps_rtttime, args[4]->tcp_ack, args[4]->tcp_seq, args[3]->tcps_rtseq);
       last[args[1]->cs_cid] = timestamp;
       last_check[args[1]->cs_cid] = timestamp;
}

Initial test results look good with S3 downloads at ~17ms latency on a 1Gbps connection jumping from ~3MB/s to ~80MB/s.

Test Plan

Download files from Amazon S3 (which doesn't negotiate timestamps) and confirm our download speeds are as good as or better than Linux performing the same download.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 7578
Build 7730: arc lint + arc unit

Event Timeline

smh updated this revision to Diff 25370.Feb 19 2017, 1:21 AM

smh retitled this revision from to Support estimated RTT for receive buffer auto resizing.

smh updated this object.

smh edited the test plan for this revision. (Show Details)

Herald added reviewers: gnn, transport. · View Herald TranscriptFeb 19 2017, 1:21 AM

Herald added a subscriber: imp. · View Herald Transcript

smh updated this object.Feb 19 2017, 1:22 AM

smh edited the test plan for this revision. (Show Details)

smh edited edge metadata.

For reference fastpath version of this hasn't been done, which need also need the same changes to:

tcp_stacks/fastpath.c

smh updated this object.Feb 19 2017, 1:43 AM

lstewart added a subscriber: lstewart.Feb 20 2017, 8:45 AM

Disabled reset receive buffer auto scaling when not in bulk receive mode, which gives an extra 20% performance increase, bring it closer to Linux.

I've left the code commented out for now and added a dtrace probe, while I invesitage the interaction with this point in the code, but the performance improvement definitely indicates this is not the optimal thing to do.

Herald edited edge metadata. · View Herald TranscriptFeb 20 2017, 9:08 AM

lstewart added a reviewer: lstewart.Feb 20 2017, 9:10 AM

smh added inline comments.Feb 20 2017, 9:13 AM

sys/netinet/tcp_input.c
1891	Given that timestamps or estimates are used to calculate srtt is there any need for separate code paths here or is it adding complexity with no real gain?

smh added a reviewer: hiren.Feb 22 2017, 9:17 AM

kbowling added a subscriber: kbowling.Mar 13 2017, 7:02 AM

Apologies for the delay in getting to this.

So, to go back to first principles for a second, the basic goal of RX sockbuf autoscaling is to avoid the window advertised to the sender becoming the limiting factor in the sender's ability to fully utilise the pipe. The sender's theoretically ideal window at any given moment will be == the path's BDP, which the receiver makes no attempt to infer/measure, so the current algorithm takes the simplistic approach of only ensuring that the largest window that can be advertised (== RX sb_hiwat) is at least 1/8 or ~12% greater than the connection's largest ever measured send window (subject to the various autoscaling and sockbuf limits). The sender's window is measured approximately once per RTT by counting bytes received over the current RTT. The algorithm makes no attempt to deal with other causes of lost transmission opportunity (e.g. if the receiving app is not adequately servicing the RX sockbuf every RTT), which arguably should be improved given the goal of this feature.

Regarding the use of SRTT to measure send window: given the imprecise requirements for the current algorithm, I suspect it's entirely reasonable to ditch the dual RTT measurement code paths and use SRTT exclusively even though it's less correct to do so. If SRTT is larger than raw RTT, we'll have a longer measurement period and likely count more bytes, which may cause a premature increase. Conversely, we may delay an increase if SRTT < raw. Neither is a show stopper.

Regarding bulk receive mode reset: it seems that perhaps the whole autoscaling code block is in the wrong place stuffed into the header prediction "fast path". Bottom line is that to trigger an increase, the connection has to be in "bulk receive mode" on account of the send window being measured as > 7/8 RX sockbuf. I also can't think of a reason why bi-directional connections shouldn't also scale their RX sockbuf.

As an aside, the "to.to_tsecr - tp->rfbuf_ts < hz" check in the TOF_TS path is completely bogus for connections with legitimate RTTs >1s (which would stand to benefit a lot from RX sockbuf scaling).

Eliminated timestamp specific code path resulting in a simplified yet still effective single SRTT path.

Implemented changes for fastpath.

Removed temporary Dtrace probes.

Herald edited edge metadata. · View Herald TranscriptMar 19 2017, 12:29 PM

Updated dtrace:

dtrace
#!/usr/sbin/dtrace -s

#pragma D option quiet
#pragma D option switchrate=25hz

int last[int];
int last_check[int];

dtrace:::BEGIN
{
       printf(" %12s %7s %9s %9s %9s %12s %12s %12s\n",
           "dt(us)",  "srtt", "newsize", "rfbuf_ts", "rtttime", "ack", "seq", "rtseq");
}

tcp:::receive-autoresize
/(args[4]->tcp_sport == 80 || args[4]->tcp_sport == 443)/
{
       elapsed = (timestamp - last[args[1]->cs_cid]) / 1000;
       printf(" %12d %7d %9d %9d %9d %12d %12d %12d\n", elapsed,
           args[3]->tcps_srtt, args[5], args[3]->tcps_rfbuf_ts, args[3]->tcps_rtttime, args[4]->tcp_ack, args[4]->tcp_seq, args[3]->tcps_rtseq);
       last[args[1]->cs_cid] = timestamp;
}

smh marked an inline comment as done.Mar 19 2017, 12:38 PM

Corrected dtrace autoresize mbuf parameter definition.

Thanks for the feedback Lawrence, based on that I've updated to use just the SRTT check, added the fastpath version and generally cleaned up.

Hopefully this captures everything, putting it in a good place to merge?

Harbormaster completed remote builds in B8132: Diff 26412.Mar 19 2017, 1:05 PM

Harbormaster completed remote builds in B8135: Diff 26416.Mar 19 2017, 1:41 PM

emaste added a subscriber: emaste.Mar 24 2017, 3:51 PM

Extract common logic to tcp_autorcvbuf as suggested by lstewart.

Harbormaster completed remote builds in B8418: Diff 26885.Mar 31 2017, 4:27 PM

Harbormaster completed remote builds in B8418: Diff 26885.Mar 31 2017, 5:05 PM

Lawrence is now happy with this in is current form, so just a prod to see if I get any comments for or against from transport members / gnn?

I like the use of DTrace as well. Please add the script as another commit.

This revision is now accepted and ready to land.Apr 7 2017, 6:00 PM

lstewart added inline comments.Apr 7 2017, 7:26 PM

sys/netinet/tcp_input.c
1532	Note that t_srtt is in units of hz ticks scaled by TCP_RTT_SHIFT, and tcp_ts_getticks() returns (confusingly) "timestamp ticks" which are fixed at 1ms i.e. if hz != 1000, the units of left and right hand sides differ. You need to tweak one of the sides accordingly.

Convert timestamp to ticks to ensure that comparisons are correct for
hz != 1000.

This revision now requires review to proceed.Apr 8 2017, 10:28 PM

Harbormaster completed remote builds in B8622: Diff 27246.Apr 8 2017, 10:28 PM

smh marked an inline comment as done.Apr 8 2017, 10:29 PM

Harbormaster completed remote builds in B8622: Diff 27246.Apr 8 2017, 11:14 PM

gnn accepted this revision.Apr 9 2017, 10:45 PM

This revision is now accepted and ready to land.Apr 9 2017, 10:45 PM

Closed by commit rS316676: Use estimated RTT for receive buffer auto resizing instead of timestamps (authored by smh). · Explain WhyApr 10 2017, 8:19 AM

This revision was automatically updated to reflect the committed changes.

Support estimated RTT for receive buffer auto resizing
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 25409

sys/netinet/in_kdtrace.h

sys/netinet/in_kdtrace.c

sys/netinet/tcp_input.c

sys/netinet/tcp_output.c

Support estimated RTT for receive buffer auto resizingClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 25409

sys/netinet/in_kdtrace.h

sys/netinet/in_kdtrace.c

sys/netinet/tcp_input.c

sys/netinet/tcp_output.c

Support estimated RTT for receive buffer auto resizing
ClosedPublic
Actions

Revision Contents
Changeset List