TCP Stacks, Improve rack to better handle reordering
ClosedPublic
Actions

Authored by rrs on Nov 19 2025, 7:41 PM.

Details

Reviewers

tuexen
rrs

Group Reviewers

transport

Summary

With a recent bug in the igb (and a few other) driver LRO mis-queuing, rack did things ok, better
than the base stack, due to the rack reordering protections in rack, but there was still room for improvements.
When a series of packets are completely mis-ordered you often times can get the acks shortly after you have
entered recovery and retransmitted the first of the packets indicated in the sack stream. Then the cum-ack
arrives basically acking all those packets. If you look at the time from when you sent the packet to when the
ack came back you can quickly determine that the ack was not to what you just transmitted but instead
was original and you had a completely false recovery entry. Dropping out of that you can then restore the
congestion state and continue on your way. The Dup-acks that also arrive help increase your reordering windows
which makes you less likely to repeat the scenario.

Test Plan

There is a first test you can do with a packet drills script which will attach below. But a far better thing
is to setup the igb bug and test across between a lab in Germany and the US.. I did this and have a nice
clean BBlog of the fixes in action (and very low retransmission rate due to it). Available by request if you
are interested.

Here is the packet drill script that Michael Tuexen created with maybe a tweak from me :)

--ip_version=ipv4

0.0000`kldload -n tcp_rack`
+0.0000`kldload -n cc_newreno`
+0.0000`sysctl kern.timecounter.alloweddeviation=0`

+0.0000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.0000 setsockopt(3, IPPROTO_TCP, TCP_LOG, [4], 4) = 0
+0.0000 setsockopt(3, IPPROTO_TCP, TCP_FUNCTION_BLK, {function_set_name="rack",

pcbcnt=0}, 36) = 0

+0.0000 setsockopt(3, IPPROTO_TCP, TCP_CONGESTION, "newreno", 8) = 0
+0.0000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.0000 bind(3, ..., ...) = 0
+0.0000 listen(3, 1) = 0
+0.0000 < S 0:0(0) win 65535 <mss 1000,sackOK,eol,eol>
+0.0000 > S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,eol,eol>
+0.0500 < . 1:1(0) ack 1 win 65535
+0.0000 accept(3, ..., ...) = 4
+0.00 setsockopt(4, IPPROTO_TCP, TCP_LOG, [4], 4) = 0
+0.0000 close(3) = 0
Trigger an initial RTT measurement of 50ms.
+0.0000 send(4, ..., 1000, 0) = 1000
+0.0000 > P. 1:1001(1000) ack 1 win 65535
+0.0500 < . 1:1(0) ack 1001 win 65535
Send 4 full sized frames
+0.5000 send(4, ..., 4000, 0) = 4000
+0.0000 > . 1001:2001(1000) ack 1 win 65535
+0.0000 > . 2001:3001(1000) ack 1 win 65535
+0.0000 > . 3001:4001(1000) ack 1 win 65535
+0.0000 > P. 4001:5001(1000) ack 1 win 65535
After an RTT get an ack for the fourth, third, and second segment.
+0.0500 < . 1:1(0) ack 1001 win 65535 <nop,nop,sack 4001:5001>
+0.0000 < . 1:1(0) ack 1001 win 65535 <nop,nop,sack 3001:5001>
+0.0000 < . 1:1(0) ack 1001 win 65535 <nop,nop,sack 2001:5001>
Retransmit the missing segment after the reordering window has passed.
+0.0125 > . 1001:2001(1000) ack 1 win 65535
+0.0500 < . 1:1(0) ack 5001 win 65535
+1.0000 < F. 1:1(0) ack 5001 win 65535
+0.0000 > . 5001:5001(0) ack 2 win 65535
+0.0000 close(4) = 0
+0.0000 > F. 5001:5001(0) ack 2 win 65535
+0.0500 < . 2:2(0) ack 5002 win 65535

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

rrs created this revision.Nov 19 2025, 7:41 PM

Herald added 1 blocking reviewer(s): transport. · View Herald TranscriptNov 19 2025, 7:41 PM

Herald added subscribers: glebius, melifaro. · View Herald Transcript

rrs requested review of this revision.Nov 19 2025, 7:41 PM

rrs accepted this revision.Dec 11 2025, 4:57 PM

This revision is now accepted and ready to land.Dec 11 2025, 4:57 PM

How does this algorithm relate to the reordering tolerance of RACK defined by step 4 of RFC 8985?

In D53832#1237840, @tuexen wrote:

How does this algorithm relate to the reordering tolerance of RACK defined by step 4 of RFC 8985?

Step 4 (page 14) of the RFC is implemented in rack. which means the reordering window stretches out as you
see dup-acks. This is implemented. This new algorithm is in addition to that. The case most noticeable by the LRO
bug mentioned you have the scenario where no dup-acks have been received and you thus have too small of reordering
window. You then end up retransmitting that one segment, which knocks you out of slow start and puts you in recovery. Immediately
behind doing that you get the ACK from the first send. This means you have left SS for no good reason due to the smaller reordering
window. What this new update does is recognize this condition and that you should not have entered recovery...undoing it so you
go back to slow-start. The later arriving dup-acks (coming in behind this segment for example) will increase your reordering window
as in step 4... so hopefully you won't go here again. However in testing between Fl and Germany I found that this issue happened
(with the LRO bug on) lots more than initially... probably due to the large bw/delay product and largish buffers.. so having this algorithm
in addition to step 4 made it so we could stay in slow start and only when we encountered a real loss event do we get out.