Page MenuHomeFreeBSD

TCP Stacks, Improve rack to better handle reordering
AcceptedPublic

Authored by rrs on Nov 19 2025, 7:41 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Dec 13, 10:45 AM
Unknown Object (File)
Thu, Dec 11, 1:08 PM
Unknown Object (File)
Thu, Dec 4, 6:19 AM
Unknown Object (File)
Wed, Dec 3, 5:52 AM
Unknown Object (File)
Nov 28 2025, 5:07 AM
Unknown Object (File)
Nov 24 2025, 4:06 AM
Unknown Object (File)
Nov 21 2025, 9:05 PM
Unknown Object (File)
Nov 20 2025, 10:31 AM
Subscribers

Details

Reviewers
tuexen
rrs
Group Reviewers
transport
Summary

With a recent bug in the igb (and a few other) driver LRO mis-queuing, rack did things ok, better
than the base stack, due to the rack reordering protections in rack, but there was still room for improvements.
When a series of packets are completely mis-ordered you often times can get the acks shortly after you have
entered recovery and retransmitted the first of the packets indicated in the sack stream. Then the cum-ack
arrives basically acking all those packets. If you look at the time from when you sent the packet to when the
ack came back you can quickly determine that the ack was not to what you just transmitted but instead
was original and you had a completely false recovery entry. Dropping out of that you can then restore the
congestion state and continue on your way. The Dup-acks that also arrive help increase your reordering windows
which makes you less likely to repeat the scenario.

Test Plan

There is a first test you can do with a packet drills script which will attach below. But a far better thing
is to setup the igb bug and test across between a lab in Germany and the US.. I did this and have a nice
clean BBlog of the fixes in action (and very low retransmission rate due to it). Available by request if you
are interested.

Here is the packet drill script that Michael Tuexen created with maybe a tweak from me :)

--ip_version=ipv4

0.0000`kldload -n tcp_rack`
+0.0000`kldload -n cc_newreno`
+0.0000`sysctl kern.timecounter.alloweddeviation=0`

+0.0000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.0000 setsockopt(3, IPPROTO_TCP, TCP_LOG, [4], 4) = 0
+0.0000 setsockopt(3, IPPROTO_TCP, TCP_FUNCTION_BLK, {function_set_name="rack",

pcbcnt=0}, 36) = 0

+0.0000 setsockopt(3, IPPROTO_TCP, TCP_CONGESTION, "newreno", 8) = 0
+0.0000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.0000 bind(3, ..., ...) = 0
+0.0000 listen(3, 1) = 0
+0.0000 < S 0:0(0) win 65535 <mss 1000,sackOK,eol,eol>
+0.0000 > S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,eol,eol>
+0.0500 < . 1:1(0) ack 1 win 65535
+0.0000 accept(3, ..., ...) = 4
+0.00 setsockopt(4, IPPROTO_TCP, TCP_LOG, [4], 4) = 0
+0.0000 close(3) = 0
Trigger an initial RTT measurement of 50ms.
+0.0000 send(4, ..., 1000, 0) = 1000
+0.0000 > P. 1:1001(1000) ack 1 win 65535
+0.0500 < . 1:1(0) ack 1001 win 65535
Send 4 full sized frames
+0.5000 send(4, ..., 4000, 0) = 4000
+0.0000 > . 1001:2001(1000) ack 1 win 65535
+0.0000 > . 2001:3001(1000) ack 1 win 65535
+0.0000 > . 3001:4001(1000) ack 1 win 65535
+0.0000 > P. 4001:5001(1000) ack 1 win 65535
After an RTT get an ack for the fourth, third, and second segment.
+0.0500 < . 1:1(0) ack 1001 win 65535 <nop,nop,sack 4001:5001>
+0.0000 < . 1:1(0) ack 1001 win 65535 <nop,nop,sack 3001:5001>
+0.0000 < . 1:1(0) ack 1001 win 65535 <nop,nop,sack 2001:5001>
Retransmit the missing segment after the reordering window has passed.
+0.0125 > . 1001:2001(1000) ack 1 win 65535
+0.0500 < . 1:1(0) ack 5001 win 65535
+1.0000 < F. 1:1(0) ack 5001 win 65535
+0.0000 > . 5001:5001(0) ack 2 win 65535
+0.0000 close(4) = 0
+0.0000 > F. 5001:5001(0) ack 2 win 65535
+0.0500 < . 2:2(0) ack 5002 win 65535

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

rrs requested review of this revision.Nov 19 2025, 7:41 PM
This revision is now accepted and ready to land.Thu, Dec 11, 4:57 PM

How does this algorithm relate to the reordering tolerance of RACK defined by step 4 of RFC 8985?

How does this algorithm relate to the reordering tolerance of RACK defined by step 4 of RFC 8985?

Step 4 (page 14) of the RFC is implemented in rack. which means the reordering window stretches out as you
see dup-acks. This is implemented. This new algorithm is in addition to that. The case most noticeable by the LRO
bug mentioned you have the scenario where no dup-acks have been received and you thus have too small of reordering
window. You then end up retransmitting that one segment, which knocks you out of slow start and puts you in recovery. Immediately
behind doing that you get the ACK from the first send. This means you have left SS for no good reason due to the smaller reordering
window. What this new update does is recognize this condition and that you should not have entered recovery...undoing it so you
go back to slow-start. The later arriving dup-acks (coming in behind this segment for example) will increase your reordering window
as in step 4... so hopefully you won't go here again. However in testing between Fl and Germany I found that this issue happened
(with the LRO bug on) lots more than initially... probably due to the large bw/delay product and largish buffers.. so having this algorithm
in addition to step 4 made it so we could stay in slow start and only when we encountered a real loss event do we get out.