Page MenuHomeFreeBSD

Add support for higher resolution timestamps
Needs ReviewPublic

Authored by mmacy on May 7 2018, 7:58 AM.

Details

Reviewers
rrs
lstewart
rstone
jeff
Group Reviewers
transport
Summary

Low latency TCP work from ISLN taken from rstone's branch and updated against HEAD. Probably does not work as is, but need to get the ball rolling on discussing the design and validation.

convert tcp timestamps to using a scaled sbintime based off of the TSC
See the following paper for background:
Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication

https://www.cs.cmu.edu/afs/cs.cmu.edu/Web/People/ekrevat/docs/SIGCOMMIncast.pdf
NB: this a first iteration and still needs to be extended to support unsynchronized TSCs and refactor the MD bits more appropriately

Test Plan

Compare the incast mitigation effects of sub-millisecond RTO versus legacy tick based timestamps

Correctness TBD.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

mmacy created this revision.May 7 2018, 7:58 AM
mmacy edited the summary of this revision. (Show Details)May 7 2018, 7:59 AM
mmacy edited the test plan for this revision. (Show Details)
mmacy added a subscriber: mjoras.
jtl added a subscriber: jtl.May 7 2018, 2:17 PM

How does this interact with the low-latency, high-precision timestamp option being discussed at the IETF?

See https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00 and https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf .

mmacy added a comment.EditedMay 7 2018, 6:10 PM
In D15337#323183, @jtl wrote:

How does this interact with the low-latency, high-precision timestamp option being discussed at the IETF?
See https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00 and https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf .

@jtl I was under the impression that that discussion had stalled. The RFC you point at expired on December 10, 2017. Is there a mailing list where this is being discussed or do I need to mail them individually?

But the answer is, it doesn't. Currently you have to have a sanely connected data center to optimally interoperate. I would very much like to be able to negotiate away delack.

However, the problem as stated is somewhat exaggerated. 100ms is well with in the exponential backoff of the current value of TCP_MAXRTSHIFT (12) for RTT values down to 10us.
(RTO == RTT + 4RTTVAR => minimum calculable RTO == 5*RTT --- 5*10us * 2^11 == 204800us == ~100ms --- and the cumulative sum of the 11 RTOs is obviously 200ms)

Delayed ACKs can lead to many spurious retransmits but not an actual reset. This problem is further lessened on Linux which does not do delayed acknowledgements for the first 16 packets of slow start - which I believe is the window during which this is most likely.

In principle causing spurious retransmits will mess up the size of the congestion window, but if the inter packet gap is substantially greater than the RTT you should be recalculating cwnd anyway. At ISLN we did not find DCTCP particularly helpful because it took too long to converge. In the data center steady state pipe availability is the exception not the norm.

Thanks.

jtl added a comment.May 7 2018, 6:26 PM
In D15337#323183, @jtl wrote:

How does this interact with the low-latency, high-precision timestamp option being discussed at the IETF?
See https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00 and https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf .

@jtl I was under the impression that that discussion had stalled. The RFC you point at expired on December 10, 2017. Is there a mailing list where this is being discussed or do I need to mail them individually?

If discussed publicly, it would probably be on the TCPM working group's IETF list. The last discussion I see there was August 2017. However, I highly suspect the work is still active, just not ready for the next round of public proposal. You could either try pinging the authors privately or sending a message to the TCPM mailing list. If they knew others were trying to solve this problem (and even considering adding code to an open-source OS), I suspect they may become more motivated to circulate an updated proposal.

But the answer is, it doesn't.

I think it would be ideal to align with whatever it looks like TCPM will standardize, assuming there is some reasonable prospect that will occur in the near-ish future.

mmacy added a comment.May 7 2018, 6:33 PM
In D15337#323285, @jtl wrote:
In D15337#323183, @jtl wrote:

How does this interact with the low-latency, high-precision timestamp option being discussed at the IETF?
See https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00 and https://www.ietf.org/proceedings/97/slides/slides-97-tcpm-tcp-options-for-low-latency-00.pdf .

@jtl I was under the impression that that discussion had stalled. The RFC you point at expired on December 10, 2017. Is there a mailing list where this is being discussed or do I need to mail them individually?

If discussed publicly, it would probably be on the TCPM working group's IETF list. The last discussion I see there was August 2017. However, I highly suspect the work is still active, just not ready for the next round of public proposal. You could either try pinging the authors privately or sending a message to the TCPM mailing list. If they knew others were trying to solve this problem (and even considering adding code to an open-source OS), I suspect they may become more motivated to circulate an updated proposal.

Thanks. I will follow up with TCPM mailing list and the authors and post any updates here.

But the answer is, it doesn't.

I think it would be ideal to align with whatever it looks like TCPM will standardize, assuming there is some reasonable prospect that will occur in the near-ish future.

Again, although an important semi-dependency, the actual _code changes_ for negotiating delack would be largely orthogonal to the work here.

mmacy added a subscriber: seanc.May 7 2018, 7:50 PM
rstone added a comment.May 7 2018, 8:06 PM

FWIW, I had a short email conversation with the writers of that draft about a year ago and they went radio silent pretty quickly. I assumed that they weren't working on it anymore.

mmacy added a comment.EditedMay 7 2018, 8:13 PM

FWIW, I had a short email conversation with the writers of that draft about a year ago and they went radio silent pretty quickly. I assumed that they weren't working on it anymore.

@mjoras told me about that. I'm going to consider their work purely as a "nice to have" until I see something concrete. My interests for the review center around validation and minimizing any negative performance impact. For the former I'll need more input from @lstewart.

To minimize performance impact I think I'll want to have two modes, "legacy" where timestamps are at tick granularity and "high-res" where they're at 5us granularity. The trick will be moving between the two. However, first we need to see how much it actually matters.

rstone added a comment.May 7 2018, 8:16 PM

The message that I got at BSDCan last year is that they wanted to see at least a full IETF draft before these changes could be merged in, which is why my work on it stalled. Has people's position on this changed?

mmacy added a comment.May 7 2018, 8:21 PM
This comment was removed by mmacy.
mmacy added a comment.EditedMay 9 2018, 11:47 PM

The message that I got at BSDCan last year is that they wanted to see at least a full IETF draft before these changes could be merged in, which is why my work on it stalled. Has people's position on this changed?

Hi Matthew,

Thanks for the information.
To provide a status update on draft-wang-tcpm-low-latency-opt-00, as described in the draft, the initial usage of the low latency option is to communicate the min RTO info between both end of the connection in order to help determine the delayed ack timeout. However, after receiving feedback from IETF, we decided to experiment with using proactive probing instead of the additional TCP option. So, the plan to use low latency option is currently dropped.

Thanks.

That’s that. We’re not going to wait on an RFC that is not being pursued.

jeff added a comment.May 13 2018, 12:15 AM

My feeling is that ticks is unlikely to go any faster on general purpose kernels and some technique like this is inevitable as we continue to scale link performance. Some slight extra CPU time is a good trade-off for also eliminating weird rounding conditions and scaling factors. Overall I support this work going forward.

I do believe we should separate as much as possible the higher resolution timers from any change in behavior. I also believe that this patch may be missing other bug fixes we did to TCP before this code was released. I have asked rstone to verify.

sys/netinet/tcp_input.c
345

It might be nice to change the unit of this variable so we're not multiplying it out everywhere. We should use the opportunity created by the churn to standardize on a unit elsewhere as well.

hiren added a subscriber: hiren.May 17 2018, 6:59 PM
mmacy added a comment.Jun 12 2018, 3:38 AM

@rstone do you think you might be able to share the work that you or Jeff did on the frontend to adaptively lower the minRTO if the pipe was full?

@rstone do you think you might be able to share the work that you or Jeff did on the frontend to adaptively lower the minRTO if the pipe was full?

I don't recall us doing anything like that, unless you mean one of these commits?

ec38141c117dd Improve recovery time from loss in a fashion similar to linux by scaling up to ssthresh by acked bytes.
1a40d03bb7cdc Lower the minimum timeout via three mechanisms. The absolute minimum is lowered, the smoothed rtt is no longer clamped by the minimum which prevents rttvar from normalizing to minrtt, and the rexmit slop is avoided in cases that should not trigger delayed acks.
54563d7adb027 Use the maximum recent cwnd when calculating the expected samples to prevent ack timing on retransmits with small window sizes from artificially lowering our average rtt. Use a different clamp value for minimum measured rtt since observing too many values in too close of proximity can artificially lower rttvar.
efbe4b9d4b366 Use an absolute timestamp to detect bad retransmits rather than the rto. When we have a very long rto we can erroneously consider a valid retransmit invalid which causes us to incorrectly restore snd_una to snd_max and attempt to continue transmitting.