If we get a Sacked peer with an MTU change we can retransmit forever if the
last bytes are sacked and the client goes away (think power off). Then we
never see the end condition and continually retransmit.
Details
- Reviewers
tuexen glebius - Group Reviewers
transport - Commits
- rG141a53cd58cd: tcp: Rack might retransmit forever.
Create a pkt drill script that does the mtu change and sacks the last bytes
and then stops talking. We can then monitor to make sure we only retransmit one time.
This packet drill script shows the problem without the patch, it will not get a TLP but get a retransmit of the first packet
(which would in theory loop forever).
// Copyright (c) 2018 Randall Stewart // All rights reserved. // // Redistribution and use in source and binary forms, with or without // modification, are permitted provided that the following conditions // are met: // 1. Redistributions of source code must retain the above copyright // notice, this list of conditions and the following disclaimer. // 2. Redistributions in binary form must reproduce the above copyright // notice, this list of conditions and the following disclaimer in the // documentation and/or other materials provided with the distribution. // // THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND // ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE // IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE // ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE // FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL // DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS // OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) // HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT // LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY // OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF // SUCH DAMAGE. // --ip_version=ipv4 0.00 `kldload -n tcp_bbr tcp_rack` +0.00 `sysctl -w net.inet.tcp.hostcache.purgenow=1` +0.00 `sysctl -w net.inet.tcp.syncookies_only=0` +0.00 `sysctl -w net.inet.tcp.syncookies=1` +0.00 `sysctl -w net.inet.tcp.rfc1323=1` +0.00 `sysctl -w net.inet.tcp.sack.enable=1` +0.00 `sysctl -w net.inet.tcp.ecn.enable=2` // Create a TCP endpoint in the ESTABLISHED state. +0.00 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0.00 fcntl(3, F_GETFL) = 0x02 (flags O_RDWR) +0.00 fcntl(3, F_SETFL, O_RDWR | O_NONBLOCK) = 0 +0.00 setsockopt(3, IPPROTO_TCP, TCP_LOG, [4], 4) = 0 +0.00 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) +0.00 > S 0:0(0) win 65535 <mss 1460,nop,wscale 8,sackOK,TS val 100 ecr 0> +0.10 < S. 0:0(0) ack 1 win 32767 <mss 1440,sackOK,TS val 400 ecr 100> +0.00 > . 1:1(0) ack 1 win 65535 <nop,nop,TS val 200 ecr 400> // Change to rack_latest +0.00 setsockopt(3, IPPROTO_TCP, TCP_NODELAY, [1], 4) = 0 +0.00 setsockopt(3, IPPROTO_TCP, TCP_FUNCTION_BLK, {function_set_name="rack", pcbcnt=0}, 36) = 0 +0.00 send(3, ..., 1000, 0) = 1000 +0.00 > P. 1:1001(1000) ack 1 win 65535 <nop,nop,TS val 250 ecr 400> +.10 < . 1:1(0) ack 1001 win 32000 <nop, nop, TS val 500 ecr 250> +0.00 send(3, ..., 1000, 0) = 1000 +0.00 > P. 1001:2001(1000) ack 1 win 65535 <nop,nop,TS val 300 ecr 500> +.10 < . 1:1(0) ack 2001 win 32000 <nop, nop, TS val 600 ecr 300> +0.10 send(3, ..., 9700, 0) = 9700 * > . 2001:3429(1428) ack 1 win 65535 <nop,nop,TS val 400 ecr 600> * > . 3429:4857(1428) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> * > . 4857:6285(1428) ack 1 win 65535 <nop,nop,TS val 600 ecr 600> * > . 6285:7713(1428) ack 1 win 65535 <nop,nop,TS val 700 ecr 600> * > . 7713:9141(1428) ack 1 win 65535 <nop,nop,TS val 800 ecr 600> * > . 9141:10569(1428) ack 1 win 65535 <nop,nop,TS val 900 ecr 600> * > P. 10569:11701(1132) ack 1 win 65535 <nop,nop,TS val 1000 ecr 600> +0.00 < [2001:3429(1428)] icmp unreachable frag_needed mtu 1200 +0.00 < . 1:1(0) ack 2001 win 32000 <nop, nop, TS val 700 ecr 300, nop, nop, sack 10569:11701> * > . 2001:3149(1148) ack 1 win 65535 <nop,nop,TS val 1100 ecr 600> * > . 3149:3429(280) ack 1 win 65535 <nop,nop,TS val 1200 ecr 700> * > . 3429:4577(1148) ack 1 win 65535 <nop,nop,TS val 1300 ecr 700> * > . 4577:4857(280) ack 1 win 65535 <nop,nop,TS val 1400 ecr 700> * > . 4857:6005(1148) ack 1 win 65535 <nop,nop,TS val 1400 ecr 700> * > . 6005:6285(280) ack 1 win 65535 <nop,nop,TS val 1400 ecr 700> * > . 6285:7433(1148) ack 1 win 65535 <nop,nop,TS val 1500 ecr 700> * > . 7433:7713(280) ack 1 win 65535 <nop,nop,TS val 1500 ecr 700> * > . 7713:8861(1148) ack 1 win 65535 <nop,nop,TS val 1600 ecr 700> * > . 8861:9141(280) ack 1 win 65535 <nop,nop,TS val 1600 ecr 700> * > . 9141:10289(1148) ack 1 win 65535 <nop,nop,TS val 1700 ecr 700> * > . 10289:10569(280) ack 1 win 65535 <nop,nop,TS val 1900 ecr 700> * > . 10289:10569(280) ack 1 win 65535 <nop,nop,TS val 2000 ecr 700> +.10 < . 1:1(0) ack 11701 win 32000 <nop, nop, TS val 800 ecr 2000> // Tear it down. +0.00 close(3) = 0 +0.00 > F. 11701:11701(0) ack 1 win 65535 <nop,nop,TS val 2100 ecr 800> +0.10 < F. 1:1(0) ack 11702 win 32767 <nop,nop,TS val 900 ecr 2100> +0.00 > . 11702:11702(0) ack 2 win 65535 <nop,nop,TS val 2200 ecr 900>
Diff Detail
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
Are you saying that you only retransmit once if the peer goes away? Could you share the packetdrill script?
Its posted above in the test plan section. Basically the "must_retransmit" flag is set with SACK enabled. And in
theory everything in flight needs to be retransmitted since the MTU changed. So we go through doing that
and expect that to happen. So while retransmitting (or maybe before retransmitting) you get a SACK
(probably because the last piece was the tail end of the SB and thus we sent a small enough piece
to fit in the smaller MTU without knowing it). So what we have is the last bytes SACKed and the
must_retransmit (without this fix) is still set, so we resend it over and over and over again if
we don't get ack's back from the peer (the peer goes silent). And that retransmit due to the
must_retran flag never stops.
What instead should be happening is the entire non-acked window *is* resent. But then a
TLP timer should be running and the must_retran flag cleared. Then the TLP timer should
fire followed by a TLP resend of the last chunk.
Without the patch using the pkt-drill script will fail because what is retransmitted is the
beginning (of the whole window of data). Of course when the script sees this it resets
and fails.
With the patch what will happen is we get a TLP, and then the pkt drill script acks all data
and closes things up.
I suspect this would be a good candidate script to go into your regression test packet drill scripts for
rack. Note you may want to clean the script up a bit (take out the BB tracing for example).
OK, I did some testing. I agree, this patch fixes the issue and improves the situation. Therefore I'm approving it.
However, I would expect that the RACK stack on response to the PTB message triggers the sending of TCP segments all containing 1148 bytes, not pairs of TCP segments containing 1148 bytes and 280 bytes of payload. If you think, improving that is a separate issue, go ahead and commit this.
The base stack behaves as I expect. This is illustrated by the following script (the ICMP message triggers the sending of all fragments,
therefore the incoming SACK segment is not relevant:
// Copyright (c) 2018 Randall Stewart // All rights reserved. // // Redistribution and use in source and binary forms, with or without // modification, are permitted provided that the following conditions // are met: // 1. Redistributions of source code must retain the above copyright // notice, this list of conditions and the following disclaimer. // 2. Redistributions in binary form must reproduce the above copyright // notice, this list of conditions and the following disclaimer in the // documentation and/or other materials provided with the distribution. // // THIS SOFTWARE IS PROVIDED BY THE AUTHOR AND CONTRIBUTORS ``AS IS'' AND // ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE // IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE // ARE DISCLAIMED. IN NO EVENT SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE // FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL // DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS // OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) // HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT // LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY // OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF // SUCH DAMAGE. // --ip_version=ipv4 0.00 `kldload -n tcp_bbr tcp_rack` +0.00 `sysctl -w net.inet.tcp.hostcache.purgenow=1` +0.00 `sysctl -w net.inet.tcp.syncookies_only=0` +0.00 `sysctl -w net.inet.tcp.syncookies=1` +0.00 `sysctl -w net.inet.tcp.rfc1323=1` +0.00 `sysctl -w net.inet.tcp.sack.enable=1` +0.00 `sysctl -w net.inet.tcp.ecn.enable=2` +0.00 `sysctl -w kern.ipc.maxsockbuf=2097152` // Create a TCP endpoint in the ESTABLISHED state. +0.00 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3 +0.00 fcntl(3, F_GETFL) = 0x02 (flags O_RDWR) +0.00 fcntl(3, F_SETFL, O_RDWR | O_NONBLOCK) = 0 +0.00 setsockopt(3, IPPROTO_TCP, TCP_LOG, [4], 4) = 0 +0.00 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress) +0.00 > S 0:0(0) win 65535 <mss 1460,nop,wscale 6,sackOK,TS val 100 ecr 0> +0.10 < S. 0:0(0) ack 1 win 32767 <mss 1440,sackOK,TS val 400 ecr 100> +0.00 > . 1:1(0) ack 1 win 65535 <nop,nop,TS val 200 ecr 400> // Change to rack_latest +0.00 setsockopt(3, IPPROTO_TCP, TCP_NODELAY, [1], 4) = 0 +0.00 setsockopt(3, IPPROTO_TCP, TCP_FUNCTION_BLK, {function_set_name="freebsd", pcbcnt=0}, 36) = 0 +0.00 send(3, ..., 1000, 0) = 1000 +0.00 > P. 1:1001(1000) ack 1 win 65535 <nop,nop,TS val 200 ecr 400> +0.10 < . 1:1(0) ack 1001 win 32000 <nop,nop,TS val 500 ecr 200> +0.00 send(3, ..., 1000, 0) = 1000 +0.00 > P. 1001:2001(1000) ack 1 win 65535 <nop,nop,TS val 300 ecr 500> +0.10 < . 1:1(0) ack 2001 win 32000 <nop,nop,TS val 600 ecr 300> +0.10 send(3, ..., 9700, 0) = 9700 +0.00 > . 2001:3429(1428) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > . 3429:4857(1428) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > . 4857:6285(1428) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > . 6285:7713(1428) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > . 7713:9141(1428) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > . 9141:10569(1428) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > P. 10569:11701(1132) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 < [2001:3429(1428)] icmp unreachable frag_needed mtu 1200 +0.00 < . 1:1(0) ack 2001 win 32000 <nop,nop,TS val 700 ecr 500,nop,nop,sack 10569:11701> +0.00 > . 2001:3149(1148) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > . 3149:4297(1148) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > . 4297:5445(1148) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > . 5445:6593(1148) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > . 6593:7741(1148) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > . 7741:8889(1148) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > . 8889:10037(1148) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > . 10037:11185(1148) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.00 > P. 11185:11701(516) ack 1 win 65535 <nop,nop,TS val 500 ecr 600> +0.10 < . 1:1(0) ack 11701 win 32000 <nop,nop,TS val 800 ecr 500> // Tear it down. +0.00 close(3) = 0 +0.00 > F. 11701:11701(0) ack 1 win 65535 <nop,nop,TS val 600 ecr 800> +0.10 < F. 1:1(0) ack 11702 win 32767 <nop,nop,TS val 900 ecr 600> +0.00 > . 11702:11702(0) ack 2 win 65535 <nop,nop,TS val 700 ecr 900>
The base stack does not maintain a sendmap. So what happens is you have
1448 byte segments .. you now must make them fit.. 1148.. so it splits each
sendmap entry into 2 segments...
If TSO was on, which it is not for packet drill, then we would have sent
a Big segment telling TSO to split it into 1448 byte segments, then
after the MTU change we would have sent single segments of the 1148 byte size.
Changing the sendmap behavior (for non-TSO) would mean doing a combination of all sendmap entries
into one big one.. which is far more work then I want to do in this commit.
It is something that might be worth doing, we can discuss this next monday when we chat.