Page MenuHomeFreeBSD

Follow rfc6675 and maintain sacked bytes to calculate correct inflight/pipe calculation.
ClosedPublic

Authored by hiren on Oct 21 2015, 11:16 PM.

Details

Summary

Randall and I have been poking at different ways to improve FreeBSD
tcp's reaction to loss. One of the major issue we found is that we do
not use information provided by SACK efficiently even though we do keep
the SACK scoreboard well in shape. Knowing amount of data in flight can
be really crucial and can help us use available capacity of the path
more efficiently. We currently do not have an accurate way of knowing
this information.

For example, inside tcp_do_segment(), while processing duplicate acks,
we try to compute amount of data inflight with:
awnd = (tp->snd_nxt - tp->snd_fack) + tp->sackhint.sack_bytes_rexmit;

Which is incorrect as it doesn't take into account whats been already
sacked by the receiver.
There are definitely other places in the stack where we do this
incorrectly.

RFC 6675 provides guidance on how to implement calculations for
bytes in flight at any point in time. Randall and I came to a conclusion
that following can provide us inflight information almost(!) accurately
with least amount of code changes:

pipe = snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit;

here,
snd_max: highest sequence number sent
snd_una: lowest sequence number sent but not yet cumulatively acked
sacked_bytes: total bytes sacked by receiver reported via SACK holes
sack_bytes_rexmit: total bytes retransmitted from holes in this recovery
period

Only missing piece in FreeBSD is sackd_bytes. This is basically total
bytes sacked by the receiver and it can be extracted from SACK holes
reported by the receiver. The approach we've decided to take is pretty
simple: we already process each ACK with sack holes in
tcp_sack_doack() and extract sack blocks out of it. We'd now also track
this new variable there which keeps track of total sacked bytes
reported.

The downside with this approach is:
There is no persistent information about sacked bytes. We recalculate
it every time we get an ACK with sack holes in it. So if, for any
reason, receiver decides to drop sack info than we get incorrect
value for inflight. This may be also true when there are more holes but
receiver can only report 3 at a time.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

hiren updated this revision to Diff 9599.Oct 21 2015, 11:16 PM
hiren retitled this revision from to Follow rfc6675 and maintain sacked bytes to calculate correct inflight/pipe calculation..
hiren updated this object.
hiren edited the test plan for this revision. (Show Details)
hiren added reviewers: transport, network.
hiren updated this object.Oct 21 2015, 11:17 PM
hiren removed a subscriber: imp.

This formula:

pipe = snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit;

Assumes holes are still in flight, and adds rexmit which is a subset of holes, double counting those bytes.

An example of why this is a problem:

Send 10 packets. SACK for 5. retransmit 5. pipe is computed as 10, reality is 5.

What we really want, I think, is to count both sacked bytes and holes together as being no longer in flight... meaning there is no real reason to distinguish between holes and sack'd segments. It also simplifies the formula:

pipe = snd_max - ((sackhint.last_sack_ack > snd_una) ? sackhint.last_sack_ack : snd_una) + sackhint.sack_bytes_rexmit

last_sack_ack appears to suffer from the same problem that sack_bytes_rexmit does: if there are too many discrete sequences to ack, the largest one might be missing and therefore last_sack_ack would be lower than expected.

hiren added a comment.EditedOct 22 2015, 1:54 AM

This formula:
pipe = snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit;
Assumes holes are still in flight, and adds rexmit which is a subset of holes, double counting those bytes.

Yes, holes are not lost yet. They exist somewhere in the network and just haven't reached the receiver. And that's why they need to be included in pipe calculation.
Now, I agree that we don't deflate pipe when we do retransmits. And this makes the approach conservative (i.e. keeps pipe large and thus let's us send less data) which I prefer opposed to what you are proposing which could burst the n/w when there is even slight reordering happening.

Quoting rfc6675:

Set a "pipe" variable to the number of outstanding octets
currently "in the pipe"; this is the data which has been
sent by the TCP sender but for which no cumulative or
selective acknowledgment has been received and the data has
not been determined to have been dropped in the network.
It is assumed that the data is still traversing the network
path.

rfc6675 verbatim implementation is a bigger effort than what I'd like to do for this pass.

First, an explanation of the formula. This is the key equation, the missing assumption in my original email:

sackhint.last_sack_ack = snd_una + sackhint.sacked_bytes + sack_holes_bytes

Original formula:

pipe = snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit

Also subtract holes from pipe:

pipe = snd_max - snd_una - sackhint.sacked_bytes - sack_holes_bytes + sackhint.sack_bytes_rexmit

Prepare for substitution of the first equation:

pipe = snd_max - (snd_una + sackhint.sacked_bytes + sack_holes_bytes) + sackhint.sack_bytes_rexmit

sackhint.last_sack_ack = snd_max - sackhint.last_sack_ack + sackhint.sack_bytes_rexmit

In short, if last_sack_ack is valid... i.e. if SACK is in play, use that instead of snd_una.

In D3971#82718, @hiren wrote:

This formula:
pipe = snd_max - snd_una - sackhint.sacked_bytes + sackhint.sack_bytes_rexmit;
Assumes holes are still in flight, and adds rexmit which is a subset of holes, double counting those bytes.

Yes, holes are not lost yet. They exist somewhere in the network and just haven't reached the receiver. And that's why they need to be included in pipe calculation.

They are *considered* somewhere in the network until 3 dupacks (or retransmit timer), per RFC. In reality, for the common case of in-order processing, the holes represent segments that are no longer in the network.

Now, I agree that we don't deflate pipe when we do retransmits. And this makes the approach conservative (i.e. keeps pipe large and thus let's us send less data) which I prefer opposed to what you are proposing which could burst the n/w when there is even slight reordering happening.

I do see the concern here, I'll agree.

Quoting rfc6675:

Set a "pipe" variable to the number of outstanding octets
currently "in the pipe"; this is the data which has been
sent by the TCP sender but for which no cumulative or
selective acknowledgment has been received and the data has
not been determined to have been dropped in the network.
It is assumed that the data is still traversing the network
path.

rfc6675 verbatim implementation is a bigger effort than what I'd like to do for this pass.

Which implies a future pass, sounds good to me.

hiren added a comment.Oct 26 2015, 4:46 PM

Thanks @jason_eggnet.com for the review.

I'd appreciate if someone else can review/comment.

rrs added a subscriber: rrs.Oct 26 2015, 9:27 PM

Hiren:

To me this looks good. You are properly computing the pipe per RFC 6675, and yes
that does mean you double count (at times) some of the data as being in-flight. The
RFC actually specifically states that :-)

So I say yes commit this :-)

In D3971#83592, @rrs wrote:

Hiren:
To me this looks good. You are properly computing the pipe per RFC 6675, and yes
that does mean you double count (at times) some of the data as being in-flight. The
RFC actually specifically states that :-)
So I say yes commit this :-)

Thanks Randall!

I'll commit this later tonight or tomorrow.

This revision was automatically updated to reflect the committed changes.