Page MenuHomeFreeBSD

tcp: Rack might retransmit forever.
ClosedPublic

Authored by rrs on Oct 26 2021, 1:46 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Oct 22, 2:42 AM
Unknown Object (File)
Sat, Oct 18, 8:01 PM
Unknown Object (File)
Sat, Oct 18, 6:50 PM
Unknown Object (File)
Tue, Oct 14, 12:01 AM
Unknown Object (File)
Tue, Oct 14, 12:00 AM
Unknown Object (File)
Tue, Oct 14, 12:00 AM
Unknown Object (File)
Mon, Oct 13, 10:30 AM
Unknown Object (File)
Fri, Oct 10, 1:38 AM
Subscribers

Details

Summary

If we get a Sacked peer with an MTU change we can retransmit forever if the
last bytes are sacked and the client goes away (think power off). Then we
never see the end condition and continually retransmit.

Test Plan

Create a pkt drill script that does the mtu change and sacks the last bytes
and then stops talking. We can then monitor to make sure we only retransmit one time.

This packet drill script shows the problem without the patch, it will not get a TLP but get a retransmit of the first packet
(which would in theory loop forever).

// Copyright (c) 2018 Randall Stewart
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions
// are met:
// 1. Redistributions of source code must retain the above copyright
//    notice, this list of conditions and the following disclaimer.
// 2. Redistributions in binary form must reproduce the above copyright
//    notice, this list of conditions and the following disclaimer in the
//    documentation and/or other materials provided with the distribution.
//
// THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
// ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
// ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
// FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
// DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
// OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
// HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
// LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
// OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
// SUCH DAMAGE.
//

--ip_version=ipv4

 0.00 `kldload -n tcp_bbr tcp_rack`
+0.00 `sysctl -w net.inet.tcp.hostcache.purgenow=1`
+0.00 `sysctl -w net.inet.tcp.syncookies_only=0`
+0.00 `sysctl -w net.inet.tcp.syncookies=1`
+0.00 `sysctl -w net.inet.tcp.rfc1323=1`
+0.00 `sysctl -w net.inet.tcp.sack.enable=1`
+0.00 `sysctl -w net.inet.tcp.ecn.enable=2`
// Create a TCP endpoint in the ESTABLISHED state.
+0.00 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.00 fcntl(3, F_GETFL) = 0x02 (flags O_RDWR)
+0.00 fcntl(3, F_SETFL, O_RDWR | O_NONBLOCK) = 0
+0.00 setsockopt(3, IPPROTO_TCP, TCP_LOG, [4], 4) = 0
+0.00 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0.00 > S  0:0(0) win 65535 <mss 1460,nop,wscale 8,sackOK,TS val 100 ecr 0>
+0.10 < S. 0:0(0) ack 1 win 32767 <mss 1440,sackOK,TS val 400 ecr 100>
+0.00 >  . 1:1(0) ack 1 win 65535 <nop,nop,TS val 200 ecr 400>
// Change to rack_latest 
+0.00 setsockopt(3, IPPROTO_TCP, TCP_NODELAY, [1], 4) = 0
+0.00 setsockopt(3, IPPROTO_TCP, TCP_FUNCTION_BLK, {function_set_name="rack", pcbcnt=0}, 36) = 0
+0.00 send(3, ..., 1000, 0) = 1000
+0.00 > P. 1:1001(1000) ack 1 win 65535 <nop,nop,TS val 250 ecr 400>
+.10 <  . 1:1(0) ack 1001 win 32000 <nop, nop, TS val 500 ecr 250>
+0.00 send(3, ..., 1000, 0) = 1000
+0.00 > P. 1001:2001(1000) ack 1 win 65535 <nop,nop,TS val 300 ecr 500>
+.10 <  . 1:1(0) ack 2001 win 32000 <nop, nop, TS val 600 ecr 300>
+0.10 send(3, ..., 9700, 0) = 9700
*  > . 2001:3429(1428) ack 1 win 65535 <nop,nop,TS val 400 ecr 600>
*  > . 3429:4857(1428) ack 1 win 65535 <nop,nop,TS val 500 ecr 600>
*  > . 4857:6285(1428) ack 1 win 65535 <nop,nop,TS val 600 ecr 600>
*  > . 6285:7713(1428) ack 1 win 65535 <nop,nop,TS val 700 ecr 600>
*  > . 7713:9141(1428) ack 1 win 65535 <nop,nop,TS val 800 ecr 600>
*  > . 9141:10569(1428) ack 1 win 65535 <nop,nop,TS val 900 ecr 600>
*  > P. 10569:11701(1132) ack 1 win 65535 <nop,nop,TS val 1000 ecr 600>
+0.00 < [2001:3429(1428)] icmp unreachable frag_needed mtu 1200
+0.00 < . 1:1(0) ack 2001 win 32000 <nop, nop, TS val 700 ecr 300, nop, nop, sack 10569:11701> 
*  > . 2001:3149(1148) ack 1 win 65535 <nop,nop,TS val 1100 ecr 600>
*  > . 3149:3429(280) ack 1 win 65535 <nop,nop,TS val 1200 ecr 700>

*  > . 3429:4577(1148) ack 1 win 65535 <nop,nop,TS val 1300 ecr 700>
*  > . 4577:4857(280) ack 1 win 65535 <nop,nop,TS val 1400 ecr 700>
*  > . 4857:6005(1148) ack 1 win 65535 <nop,nop,TS val 1400 ecr 700>
*  > . 6005:6285(280) ack 1 win 65535 <nop,nop,TS val 1400 ecr 700>
*  > . 6285:7433(1148) ack 1 win 65535 <nop,nop,TS val 1500 ecr 700>
*  > . 7433:7713(280) ack 1 win 65535 <nop,nop,TS val 1500 ecr 700>
*  > . 7713:8861(1148) ack 1 win 65535 <nop,nop,TS val 1600 ecr 700>
*  > . 8861:9141(280) ack 1 win 65535 <nop,nop,TS val 1600 ecr 700>
*  > . 9141:10289(1148) ack 1 win 65535 <nop,nop,TS val 1700 ecr 700>
*  > . 10289:10569(280) ack 1 win 65535 <nop,nop,TS val 1900 ecr 700>
*  > . 10289:10569(280) ack 1 win 65535 <nop,nop,TS val 2000 ecr 700>
+.10 <  . 1:1(0) ack 11701 win 32000 <nop, nop, TS val 800 ecr 2000>
// Tear it down.
+0.00 close(3) = 0
+0.00 > F. 11701:11701(0) ack 1 win 65535 <nop,nop,TS val 2100 ecr 800>
+0.10 < F. 1:1(0) ack 11702 win 32767 <nop,nop,TS val 900 ecr 2100>
+0.00 > . 11702:11702(0) ack 2 win 65535 <nop,nop,TS val 2200 ecr 900>

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

rrs requested review of this revision.Oct 26 2021, 1:46 PM

Are you saying that you only retransmit once if the peer goes away? Could you share the packetdrill script?

Are you saying that you only retransmit once if the peer goes away? Could you share the packetdrill script?

Its posted above in the test plan section. Basically the "must_retransmit" flag is set with SACK enabled. And in
theory everything in flight needs to be retransmitted since the MTU changed. So we go through doing that
and expect that to happen. So while retransmitting (or maybe before retransmitting) you get a SACK
(probably because the last piece was the tail end of the SB and thus we sent a small enough piece
to fit in the smaller MTU without knowing it). So what we have is the last bytes SACKed and the
must_retransmit (without this fix) is still set, so we resend it over and over and over again if
we don't get ack's back from the peer (the peer goes silent). And that retransmit due to the
must_retran flag never stops.

What instead should be happening is the entire non-acked window *is* resent. But then a
TLP timer should be running and the must_retran flag cleared. Then the TLP timer should
fire followed by a TLP resend of the last chunk.

Without the patch using the pkt-drill script will fail because what is retransmitted is the
beginning (of the whole window of data). Of course when the script sees this it resets
and fails.

With the patch what will happen is we get a TLP, and then the pkt drill script acks all data
and closes things up.

I suspect this would be a good candidate script to go into your regression test packet drill scripts for
rack. Note you may want to clean the script up a bit (take out the BB tracing for example).

OK, I did some testing. I agree, this patch fixes the issue and improves the situation. Therefore I'm approving it.

However, I would expect that the RACK stack on response to the PTB message triggers the sending of TCP segments all containing 1148 bytes, not pairs of TCP segments containing 1148 bytes and 280 bytes of payload. If you think, improving that is a separate issue, go ahead and commit this.

The base stack behaves as I expect. This is illustrated by the following script (the ICMP message triggers the sending of all fragments,
therefore the incoming SACK segment is not relevant:

// Copyright (c) 2018 Randall Stewart
// All rights reserved.
//
// Redistribution and use in source and binary forms, with or without
// modification, are permitted provided that the following conditions
// are met:
// 1. Redistributions of source code must retain the above copyright
//    notice, this list of conditions and the following disclaimer.
// 2. Redistributions in binary form must reproduce the above copyright
//    notice, this list of conditions and the following disclaimer in the
//    documentation and/or other materials provided with the distribution.
//
// THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND
// ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
// IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
// ARE DISCLAIMED.  IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE
// FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
// DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
// OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
// HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
// LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
// OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
// SUCH DAMAGE.
//

--ip_version=ipv4

 0.00 `kldload -n tcp_bbr tcp_rack`
+0.00 `sysctl -w net.inet.tcp.hostcache.purgenow=1`
+0.00 `sysctl -w net.inet.tcp.syncookies_only=0`
+0.00 `sysctl -w net.inet.tcp.syncookies=1`
+0.00 `sysctl -w net.inet.tcp.rfc1323=1`
+0.00 `sysctl -w net.inet.tcp.sack.enable=1`
+0.00 `sysctl -w net.inet.tcp.ecn.enable=2`
+0.00 `sysctl -w kern.ipc.maxsockbuf=2097152`
// Create a TCP endpoint in the ESTABLISHED state.
+0.00 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.00 fcntl(3, F_GETFL) = 0x02 (flags O_RDWR)
+0.00 fcntl(3, F_SETFL, O_RDWR | O_NONBLOCK) = 0
+0.00 setsockopt(3, IPPROTO_TCP, TCP_LOG, [4], 4) = 0
+0.00 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0.00 > S      0:0(0)                  win 65535 <mss 1460,nop,wscale 6,sackOK,TS val 100 ecr 0>
+0.10 < S.     0:0(0)        ack     1 win 32767 <mss 1440,sackOK,TS val 400 ecr 100>
+0.00 >  .     1:1(0)        ack     1 win 65535 <nop,nop,TS val  200 ecr  400>
// Change to rack_latest 
+0.00 setsockopt(3, IPPROTO_TCP, TCP_NODELAY, [1], 4) = 0
+0.00 setsockopt(3, IPPROTO_TCP, TCP_FUNCTION_BLK, {function_set_name="freebsd", pcbcnt=0}, 36) = 0
+0.00 send(3, ..., 1000, 0) = 1000
+0.00 > P.     1:1001(1000)  ack     1 win 65535 <nop,nop,TS val  200 ecr  400>
+0.10 <  .     1:1(0)        ack  1001 win 32000 <nop,nop,TS val  500 ecr  200>
+0.00 send(3, ..., 1000, 0) = 1000
+0.00 > P.  1001:2001(1000)  ack     1 win 65535 <nop,nop,TS val  300 ecr  500>
+0.10 <  .     1:1(0)        ack  2001 win 32000 <nop,nop,TS val  600 ecr  300>
+0.10 send(3, ..., 9700, 0) = 9700
+0.00 >  .  2001:3429(1428)  ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 >  .  3429:4857(1428)  ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 >  .  4857:6285(1428)  ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 >  .  6285:7713(1428)  ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 >  .  7713:9141(1428)  ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 >  .  9141:10569(1428) ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 > P. 10569:11701(1132) ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 < [2001:3429(1428)] icmp unreachable frag_needed mtu 1200
+0.00 <  .     1:1(0)        ack  2001 win 32000 <nop,nop,TS val  700 ecr  500,nop,nop,sack 10569:11701>
+0.00 >  .  2001:3149(1148)  ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 >  .  3149:4297(1148)  ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 >  .  4297:5445(1148)  ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 >  .  5445:6593(1148)  ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 >  .  6593:7741(1148)  ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 >  .  7741:8889(1148)  ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 >  .  8889:10037(1148) ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 >  . 10037:11185(1148) ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.00 > P. 11185:11701(516)  ack     1 win 65535 <nop,nop,TS val  500 ecr  600>
+0.10 <  .     1:1(0)        ack 11701 win 32000 <nop,nop,TS val  800 ecr  500>
// Tear it down.
+0.00 close(3) = 0
+0.00 > F. 11701:11701(0)    ack     1 win 65535 <nop,nop,TS val  600 ecr  800>
+0.10 < F.     1:1(0)        ack 11702 win 32767 <nop,nop,TS val  900 ecr  600>
+0.00 >  . 11702:11702(0)    ack     2 win 65535 <nop,nop,TS val  700 ecr  900>
This revision is now accepted and ready to land.Oct 29 2021, 9:05 PM

The base stack does not maintain a sendmap. So what happens is you have
1448 byte segments .. you now must make them fit.. 1148.. so it splits each
sendmap entry into 2 segments...

If TSO was on, which it is not for packet drill, then we would have sent
a Big segment telling TSO to split it into 1448 byte segments, then
after the MTU change we would have sent single segments of the 1148 byte size.

Changing the sendmap behavior (for non-TSO) would mean doing a combination of all sendmap entries
into one big one.. which is far more work then I want to do in this commit.

It is something that might be worth doing, we can discuss this next monday when we chat.

This revision was automatically updated to reflect the committed changes.