Page MenuHomeFreeBSD

TCP can be subject to Sack Attacks lets fix this issue.
ClosedPublic

Authored by rrs on Apr 22 2024, 2:18 PM.
Tags
None
Referenced Files
F105588704: D44903.diff
Wed, Dec 18, 12:37 AM
F105508032: D44903.id137896.diff
Tue, Dec 17, 1:46 AM
Unknown Object (File)
Fri, Dec 13, 2:40 AM
Unknown Object (File)
Mon, Dec 9, 1:19 AM
Unknown Object (File)
Wed, Dec 4, 7:30 AM
Unknown Object (File)
Tue, Dec 3, 12:37 AM
Unknown Object (File)
Fri, Nov 29, 11:12 AM
Unknown Object (File)
Tue, Nov 26, 4:49 PM
Subscribers

Details

Reviewers
rscheff
tuexen
Group Reviewers
transport
Summary

There is a type of attack that a TCP peer can launch on a connection. This is for sure in Rack or BBR and probably even the default stack if it uses lists in sack processing. The idea of the attack is that the attacker is driving you to look at 100's of sack blocks that only update 1 byte. So for example if you have 1 - 10,000 bytes outstanding the attacker sends in something like:

ACK 0 SACK(1-512) SACK(1024 - 1536), SACK(2048-2536), SACK(4096 - 4608), SACK(8192-8704)

This first sack looks fine but then the attacker sends

ACK 0 SACK(1-512) SACK(1025 - 1537), SACK(2049-2537), SACK(4097 - 4609), SACK(8193-8705)
ACK 0 SACK(1-512) SACK(1027 - 1539), SACK(2051-2539), SACK(4099 - 4611), SACK(8195-8707)
...

These blocks are making you hunt across your linked list and split things up so that you have an entry for every other byte. Has your list grows you spend more and more CPU running through the lists. The idea here is the attacker chooses entries as far apart as possible that make you run through the list. This example is small but in theory if the window is open to say 1Meg you could end up with 100's of thousands link list entries.

To combat this we introduce three things.

  1. when the peer requests a very small MSS we stop processing SACK's from them. This prevents a malicious peer from just using a small MSS to do the same thing.
  2. Any time we get a sack block, we use the sack-filter to remove sacks that are smaller than the smallest v4 mss (minus 40 for max TCP options) unless it ties up to snd_max (since that is legal). All other sacks in theory should be at least an MSS. If we get such an attacker that means we basically start skipping all but MSS sized Sacked blocks.
  3. The sack filter used to throw away data when its bounds were exceeded, instead now we increase its size to 15 and then throw away sack's if the filter gets over-run to prevent the malicious attacker from over-running the sack filter and thus we start to process things anyway.

The default stack will need to start using the sack-filter which we have talked about in past conference calls to take full advantage of the protections offered by it (and reduce cpu consumption when processing sacks).

After this set of changes is in rack can drop its SAD detection completely

Test Plan

We can test a lot of the sack_filter changes in user space (which I have yet to do so there may be updates coming to this). Then I will also develop some packet drill scripts which I will insert here which can be used in conjunction with BBlogs that things are working as we expect i.e. normal sacks work properly and illicit ones are ignored.

Ok I have validated now by compiling sack_fliter in user space and playing with all the corner cases and the general. It nicely blocks single byte sacks of most flavors unless we are compressing the map. I have refined the end tests too so you can't attack at the end using 1 byte at a time that spans to the sndmax.

Still have yet to build the pkt_drill script that will be here to test the low mss case i.e. to validate we just turn sack off. Will do that early next week.

Ok at this point here are 4 packet drill scripts 2 for server side and 2 for client side they validate that we properly ignore sacks on small MSS and properly handle sacks on normal sized MSS note that the validation of the sack filter was done in user space by compiling the sack_filter.c and running various scenarios to validate ignoring small sacks.
This completes testing so I think we are ready to go.

test_sf1.pkt

--ip_version=ipv4

+0.00 `sysctl -w net.inet.tcp.hostcache.purgenow=1`
+0.00 `sysctl -w net.inet.tcp.syncookies_only=0`
+0.00 `sysctl -w net.inet.tcp.syncookies=1`
+0.00 `sysctl -w net.inet.tcp.rfc1323=1`
+0.00 `sysctl -w net.inet.tcp.sack.enable=1`
+0.00 `sysctl -w net.inet.tcp.ecn.enable=2`
// Create a TCP endpoint in the ESTABLISHED state.
+0.00 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.00 fcntl(3, F_GETFL) = 0x02 (flags O_RDWR)
+0.00 fcntl(3, F_SETFL, O_RDWR | O_NONBLOCK) = 0
+0.00 setsockopt(3, IPPROTO_TCP, TCP_LOG, [4], 4) = 0
+0.00 setsockopt(3, IPPROTO_TCP, TCP_FUNCTION_BLK, {function_set_name="rack", pcbcnt=0}, 36) = 0
+0.00 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0.00 > S  0:0(0) win 65535 <mss 1460,nop,wscale 8,sackOK,TS val 100 ecr 0>
+0.10 < S. 0:0(0) ack 1 win 32767 <mss 1428,sackOK, nop, nop>
+0.00 >  . 1:1(0) ack 1 win 65535
+0.00 send(3, ..., 1428, 0) = 1428
+0.00 > P. 1:1429(1428) ack 1 win 65535
+.10 <  . 1:1(0) ack 1429 win 32000 
+0.10 send(3, ..., 11424, 0) = 11424
*  > . 1429:2857(1428) ack 1 win 65535
*  > . 2857:4285(1428) ack 1 win 65535 
+.00 <  . 1:1(0) ack 4285 win 32000 
*  > . 4285:5713(1428) ack 1 win 65535 
*  > . 5713:7141(1428) ack 1 win 65535 
+.00 <  . 1:1(0) ack 7141 win 32000 
*  > . 7141:8569(1428) ack 1 win 65535 
*  > . 8569:9997(1428) ack 1 win 65535 
+.00 <  . 1:1(0) ack 9997 win 32000 
*  > . 9997:11425(1428) ack 1 win 65535
*  > P. 11425:12853(1428) ack 1 win 65535
+.00 <  . 1:1(0) ack 12853 win 32000 
+0.10 send(3, ..., 11424, 0) = 11424
*  > . 12853:14281 (1428) ack 1 win 65535
*  > . 14281:15709 (1428) ack 1 win 65535
+.00 <  . 1:1(0) ack  15709 win 32000 
*  > . 15709:17137 (1428) ack 1 win 65535
*  > . 17137:18565 (1428) ack 1 win 65535
+.00 <  . 1:1(0) ack  18565 win 32000 
*  > . 18565:19993 (1428) ack 1 win 65535
*  > . 19993:21421 (1428) ack 1 win 65535
+.00 <  . 1:1(0) ack 21421 win 32000 
*  > . 21421:22849 (1428) ack 1 win 65535
*  > P. 22849:24277 (1428) ack 1 win 65535
+.00 <  . 1:1(0) ack 24277 win 32000 
+0.10 send(3, ..., 11424, 0) = 11424
*  > . 24277:25705 (1428) ack 1 win 65535
*  > . 25705:27133 (1428) ack 1 win 65535
*  > . 27133:28561 (1428) ack 1 win 65535
*  > . 28561:29989 (1428) ack 1 win 65535
*  > . 29989:31417 (1428) ack 1 win 65535
*  > . 31417:32845 (1428) ack 1 win 65535
*  > . 32845:34273 (1428) ack 1 win 65535
*  > P. 34273:35701 (1428) ack 1 win 65535
+0.0 <  . 1:1(0) ack 24277 win 32000 <nop, nop, sack 25705:27133>
+0.0 <  . 1:1(0) ack 24277 win 32000 <nop, nop, sack 28561:29989 25705:27133>
+0.0 <  . 1:1(0) ack 24277 win 32000 <nop, nop, sack 28561:31417 25705:27133>
+0.0 <  . 1:1(0) ack 24277 win 32000 <nop, nop, sack 28561:32845 25705:27133>
+0.0 <  . 1:1(0) ack 24277 win 32000 <nop, nop, sack 28561:34273 25705:27133>
+0.0 <  . 1:1(0) ack 24277 win 32000 <nop, nop, sack 28561:35701 25705:27133>
*  > . 24277:25705 (1428) ack 1 win 65535
*  > . 27133:28561 (1428) ack 1 win 65535
+0.0 <  . 1:1(0) ack 35701 win 32000 
+1.100 close(3) = 0
+0.00 > F. 35701:35701 (0) ack 1 win 65535
+0.00 < F. 1:1(0) ack 35702 win 40002
+0.00 > . 35702:35702 (0) ack 2 win 65535

test_sf2.pkt

--ip_version=ipv4

+0.00 `sysctl -w net.inet.tcp.hostcache.purgenow=1`
+0.00 `sysctl -w net.inet.tcp.syncookies_only=0`
+0.00 `sysctl -w net.inet.tcp.syncookies=1`
+0.00 `sysctl -w net.inet.tcp.rfc1323=1`
+0.00 `sysctl -w net.inet.tcp.sack.enable=1`
+0.00 `sysctl -w net.inet.tcp.ecn.enable=2`
// Create a TCP endpoint in the ESTABLISHED state.
+0.00 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.00 fcntl(3, F_GETFL) = 0x02 (flags O_RDWR)
+0.00 fcntl(3, F_SETFL, O_RDWR | O_NONBLOCK) = 0
+0.00 setsockopt(3, IPPROTO_TCP, TCP_LOG, [4], 4) = 0
+0.00 setsockopt(3, IPPROTO_TCP, TCP_FUNCTION_BLK, {function_set_name="rack", pcbcnt=0}, 36) = 0
+0.00 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0.00 > S  0:0(0) win 65535 <mss 1460,nop,wscale 8,sackOK,TS val 100 ecr 0>
+0.10 < S. 0:0(0) ack 1 win 32767 <mss 400,sackOK, nop, nop>
+0.00 >  . 1:1(0) ack 1 win 65535
+0.00 send(3, ..., 400, 0) = 400
+0.00 > P. 1:401(400) ack 1 win 65535
+.10 <  . 1:1(0) ack 401 win 32000 
+0.10 send(3, ..., 4000, 0) = 4000
*  > . 401:801(400) ack 1 win 65535
*  > . 801:1201(400) ack 1 win 65535 
+.02 <  . 1:1(0) ack 1201 win 32000 
*  > . 1201:1601(400) ack 1 win 65535 
*  > . 1601:2001(400) ack 1 win 65535 
+.02 <  . 1:1(0) ack 2001 win 32000 
*  > . 2001:2401(400) ack 1 win 65535 
*  > . 2401:2801(400) ack 1 win 65535 
+.02 <  . 1:1(0) ack 2801 win 32000 
*  > . 2801:3201(400) ack 1 win 65535
*  > . 3201:3601(400) ack 1 win 65535
+.02 <  . 1:1(0) ack 2801 win 32000 
*  > . 3601:4001(400) ack 1 win 65535
*  > P. 4001:4401(400) ack 1 win 65535
+.02 <  . 1:1(0) ack 4401 win 32000 
+0.10 send(3, ..., 4000, 0) = 4000
*  > . 4401:4801 (400) ack 1 win 65535
*  > . 4801:5201 (400) ack 1 win 65535
*  > . 5201:5601 (400) ack 1 win 65535
*  > . 5601:6001 (400) ack 1 win 65535
*  > . 6001:6401 (400) ack 1 win 65535
*  > . 6401:6801 (400) ack 1 win 65535
*  > . 6801:7201 (400) ack 1 win 65535
*  > . 7201:7601 (400) ack 1 win 65535
*  > . 7601:8001 (400) ack 1 win 65535
*  > P. 8001:8401 (400) ack 1 win 65535
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 4801:5201>
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 5601:6001 4801:5201>
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 5601:6401 4801:5201>
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 5601:6801 4801:5201>
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 5601:7201 4801:5201>
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 5601:8001 4801:5201>
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 5601:8401 4801:5201>
*  > . 4401:4801 (400) ack 1 win 65535
+0.0 <  . 1:1(0) ack 4801 win 32000 <nop, nop, sack 5601:8401>
+0.0 <  . 1:1(0) ack 4801 win 32000 <nop, nop, sack 5601:8401>
*  > . 4801:5201 (400) ack 1 win 65535
+0.0 <  . 1:1(0) ack 4801 win 32000 <nop, nop, sack 5601:8401>
*  > . 5201:5601 (400) ack 1 win 65535
+0.0 <  . 1:1(0) ack 8401 win 32000 
+1.100 close(3) = 0
+0.00 > F. 8401:8401 (0) ack 1 win 65535
+0.00 < F. 1:1(0) ack 8402 win 40002
+0.00 > . 8402:8402 (0) ack 2 win 65535

test_sf3.pkt

+0.00 `sysctl -w net.inet.tcp.sack.enable=1`
+0.00 `sysctl -w net.inet.tcp.hostcache.purgenow=1`
// Create a TCP endpoint in the ESTABLISHED state.
+0.00 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.00 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.00 bind(3, ..., ...) = 0
+0.00 listen(3, 1) = 0
// Establish the connection
+0.00 < S   0:0(0) win 65535 <mss 1428,sackOK,eol,eol>
+0.00 > S.  0:0(0) ack 1 win 65535 <mss 1460,sackOK,eol,eol>
+0.01 <  .  1:1(0) ack 1 win 65535
+0.0 accept(3, ..., ...) = 4
+0.00 close(3) = 0
+0.00 setsockopt(4, IPPROTO_TCP, TCP_LOG, [4], 4) = 0
+0.00 setsockopt(4, IPPROTO_TCP, TCP_FUNCTION_BLK, {function_set_name="rack", pcbcnt=0}, 36) = 0
+0.00 send(4, ..., 1428, 0) = 1428
+0.00 > P. 1:1429(1428) ack 1 win 65535
+.10 <  . 1:1(0) ack 1429 win 32000 
+0.10 send(4, ..., 11424, 0) = 11424
*  > . 1429:2857(1428) ack 1 win 65535
*  > . 2857:4285(1428) ack 1 win 65535 
+.00 <  . 1:1(0) ack 4285 win 32000 
*  > . 4285:5713(1428) ack 1 win 65535 
*  > . 5713:7141(1428) ack 1 win 65535 
+.00 <  . 1:1(0) ack 7141 win 32000 
*  > . 7141:8569(1428) ack 1 win 65535 
*  > . 8569:9997(1428) ack 1 win 65535 
+.00 <  . 1:1(0) ack 9997 win 32000 
*  > . 9997:11425(1428) ack 1 win 65535
*  > P. 11425:12853(1428) ack 1 win 65535
+.00 <  . 1:1(0) ack 12853 win 32000 
+0.10 send(4, ..., 11424, 0) = 11424
*  > . 12853:14281 (1428) ack 1 win 65535
*  > . 14281:15709 (1428) ack 1 win 65535
+.00 <  . 1:1(0) ack  15709 win 32000 
*  > . 15709:17137 (1428) ack 1 win 65535
*  > . 17137:18565 (1428) ack 1 win 65535
+.00 <  . 1:1(0) ack  18565 win 32000 
*  > . 18565:19993 (1428) ack 1 win 65535
*  > . 19993:21421 (1428) ack 1 win 65535
+.00 <  . 1:1(0) ack 21421 win 32000 
*  > . 21421:22849 (1428) ack 1 win 65535
*  > P. 22849:24277 (1428) ack 1 win 65535
+.00 <  . 1:1(0) ack 24277 win 32000 
+0.10 send(4, ..., 11424, 0) = 11424
*  > . 24277:25705 (1428) ack 1 win 65535
*  > . 25705:27133 (1428) ack 1 win 65535
*  > . 27133:28561 (1428) ack 1 win 65535
*  > . 28561:29989 (1428) ack 1 win 65535
*  > . 29989:31417 (1428) ack 1 win 65535
*  > . 31417:32845 (1428) ack 1 win 65535
*  > . 32845:34273 (1428) ack 1 win 65535
*  > P. 34273:35701 (1428) ack 1 win 65535
+0.0 <  . 1:1(0) ack 24277 win 32000 <nop, nop, sack 25705:27133>
+0.0 <  . 1:1(0) ack 24277 win 32000 <nop, nop, sack 28561:29989 25705:27133>
+0.0 <  . 1:1(0) ack 24277 win 32000 <nop, nop, sack 28561:31417 25705:27133>
+0.0 <  . 1:1(0) ack 24277 win 32000 <nop, nop, sack 28561:32845 25705:27133>
+0.0 <  . 1:1(0) ack 24277 win 32000 <nop, nop, sack 28561:34273 25705:27133>
+0.0 <  . 1:1(0) ack 24277 win 32000 <nop, nop, sack 28561:35701 25705:27133>
*  > . 24277:25705 (1428) ack 1 win 65535
*  > . 27133:28561 (1428) ack 1 win 65535
+0.0 <  . 1:1(0) ack 35701 win 32000 
+1.100 close(4) = 0
+0.00 > F. 35701:35701 (0) ack 1 win 65535
+0.00 < F. 1:1(0) ack 35702 win 40002
+0.00 > . 35702:35702 (0) ack 2 win 65535

test_sf4.pkt

--ip_version=ipv4
+0.00 `sysctl -w net.inet.tcp.sack.enable=1`
+0.00 `sysctl -w net.inet.tcp.hostcache.purgenow=1`
// Create a TCP endpoint in the ESTABLISHED state.
+0.00 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0.00 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
+0.00 bind(3, ..., ...) = 0
+0.00 listen(3, 1) = 0
// Establish the connection
+0.00 < S   0:0(0) win 65535 <mss 400,sackOK,eol,eol>
+0.00 > S.  0:0(0) ack 1 win 65535 <mss 1460,sackOK,eol,eol>
+0.01 <  .  1:1(0) ack 1 win 65535
+0.0 accept(3, ..., ...) = 4
+0.00 close(3) = 0
+0.00 setsockopt(4, IPPROTO_TCP, TCP_LOG, [4], 4) = 0
+0.00 setsockopt(4, IPPROTO_TCP, TCP_FUNCTION_BLK, {function_set_name="rack", pcbcnt=0}, 36) = 0
+0.00 send(4, ..., 400, 0) = 400
+0.00 > P. 1:401(400) ack 1 win 65535
+.10 <  . 1:1(0) ack 401 win 32000 
+0.10 send(4, ..., 4000, 0) = 4000
*  > . 401:801(400) ack 1 win 65535
*  > . 801:1201(400) ack 1 win 65535 
+.02 <  . 1:1(0) ack 1201 win 32000 
*  > . 1201:1601(400) ack 1 win 65535 
*  > . 1601:2001(400) ack 1 win 65535 
+.02 <  . 1:1(0) ack 2001 win 32000 
*  > . 2001:2401(400) ack 1 win 65535 
*  > . 2401:2801(400) ack 1 win 65535 
+.02 <  . 1:1(0) ack 2801 win 32000 
*  > . 2801:3201(400) ack 1 win 65535
*  > . 3201:3601(400) ack 1 win 65535
+.02 <  . 1:1(0) ack 2801 win 32000 
*  > . 3601:4001(400) ack 1 win 65535
*  > P. 4001:4401(400) ack 1 win 65535
+.02 <  . 1:1(0) ack 4401 win 32000 
+0.10 send(4, ..., 4000, 0) = 4000
*  > . 4401:4801 (400) ack 1 win 65535
*  > . 4801:5201 (400) ack 1 win 65535
*  > . 5201:5601 (400) ack 1 win 65535
*  > . 5601:6001 (400) ack 1 win 65535
*  > . 6001:6401 (400) ack 1 win 65535
*  > . 6401:6801 (400) ack 1 win 65535
*  > . 6801:7201 (400) ack 1 win 65535
*  > . 7201:7601 (400) ack 1 win 65535
*  > . 7601:8001 (400) ack 1 win 65535
*  > P. 8001:8401 (400) ack 1 win 65535
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 4801:5201>
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 5601:6001 4801:5201>
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 5601:6401 4801:5201>
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 5601:6801 4801:5201>
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 5601:7201 4801:5201>
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 5601:8001 4801:5201>
+0.0 <  . 1:1(0) ack 4401 win 32000 <nop, nop, sack 5601:8401 4801:5201>
*  > . 4401:4801 (400) ack 1 win 65535
+0.0 <  . 1:1(0) ack 4801 win 32000 <nop, nop, sack 5601:8401>
+0.0 <  . 1:1(0) ack 4801 win 32000 <nop, nop, sack 5601:8401>
*  > . 4801:5201 (400) ack 1 win 65535
+0.0 <  . 1:1(0) ack 4801 win 32000 <nop, nop, sack 5601:8401>
*  > . 5201:5601 (400) ack 1 win 65535
+0.0 <  . 1:1(0) ack 8401 win 32000 
+1.100 close(4) = 0
+0.00 > F. 8401:8401 (0) ack 1 win 65535
+0.00 < F. 1:1(0) ack 8402 win 40002
+0.00 > . 8402:8402 (0) ack 2 win 65535

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

rrs requested review of this revision.Apr 22 2024, 2:18 PM
rrs edited the test plan for this revision. (Show Details)

Updated the diff to fix some corner cases and to better handle compile and testing in user space.

One minor update my user space test did not show a new unused var that my kernel compile shows up. Fix that and
update the test information since I now validated the kernel version as well.

Add detailed comments so it is more clear what is going on inside sack_filter.h

One more update, need to use the tcp function call not t_maxseg - 12 to get the maxseg for filtering purposes

sys/netinet/tcp_stacks/sack_filter.c
162

Why do these checks every time? Wouldn't a "if (sf_bits) {if b->start.... if b->end}" prior to the loop suffice?

318–326

b->start should hopefully very infrequently be idential to b->end. But with the above suggested move of some sanity checks prior to the loop, this here becomes superfluous...

394

Shouldn't this never be the case? Above, blkboard is populated only by "in"coming SACKs when they were not in the filter?

406

Nit: why not do the post-increment in the previous line directly, in all these instances?

414

What benefit is there to copying out out a used slot into another slot, and putting the new block in the current slot? Most functions seem to traverse the full cachline - wouldn't it be more efficient to pick the next empty slot, and copy only the new block there? Or it the effect of this to maintain a sorted Last-Recently-Used ordering in the sf list?

sys/netinet/tcp_stacks/sack_filter.h
49
83
123

Need to define functions not accessible when compiling as stand-alone utility.

sys/netinet/tcp_stacks/sack_filter.c
162

Good point that would improve efficiency slightly :)

I will fix that too

318–326

I dont agree here. In theory the blocks in the sf may not be collapsed, if so then you could end up consuming the front of a block
in one sf element and then the rest of a block in another. You would not realize this condition until you exited the loop above.

394

I think you are correct but I want to mull on this one for a while :)

406

I never do post increment. It saves no CPU and it makes it far too easy
for a reader to miss an increment that happened. Its one of my Nits actually I
strongly dis-like post-increment for code clarity

414

The effect is to try to maintain the sorted order though I have found in playing in user space that this does not always work.

sys/netinet/tcp_stacks/sack_filter.h
123

Ahh yeah I just upgraded the statement to use that .. so it used to be
t->t_maxseg - 12

I forgot about the user space fun.. I will fix that :)

Address all of Richards comments including the one I had to think about it was correct :)

sys/netinet/tcp_stacks/sack_filter.c
106

Why do you need your own version of it? See comment below.

581

Why don't we use mss - 40 here? Using tcp_fixed_maxseg() does not accommodate if we sent a SACK or AccECN option.

tuexen edited the test plan for this revision. (Show Details)
sys/netinet/tcp_stacks/sack_filter.c
318–326

But the b block is never changed after the initial check, if either edge is below sf_acks; and you already return above the loop for this case now under all circumstances (including an empty sf). for the case these two end up being equal (e.g. incoming DSACK after the adjustment).

Just saying that b->start will always be not equal to b->end here, so the outer if clause is unnecessary; it was neccessary before, when sf is empty and the start==end check wasn't reached earlier...

414

Hmm... under "regular" circumstances you'd expect the new SACK blocks to arrive from left to right; But as soon as the ACKs get reordered (or the segments delivered out-of-order, you can end up in with a more or less random order in the SACK blocks.

However, I'd expect most recently seen SACK blocks will still more frequently repeated - so trying to keep the list ordered in most recently used to some extent may be valuable - if this is possible without constant sorting (e.g. trickelling down the least used slot logically - i'm sure there are fancy mechanisms doing something like this with per-slot counters, which could be as small as 2 bits;

For the "normal" use case, I would thing a simple cyclic buffer, overwriting the (cur+1) entry always, would work fine? And save the swapping of the entry. Incoming blocks should then be searched in decending order from cur, as that is the likliest place to find them...

Maybe if i understand the standalone checks better, the difference in the evolution can be checked quickly...

rrs marked an inline comment as done.May 4 2024, 12:59 PM
rrs added inline comments.
sys/netinet/tcp_stacks/sack_filter.c
106

Please note this is ifndef kernel.. so its when you are compiling the sack_filter.c into user space for testing. The value of mss is settable by command line option so it really does not matter about this function.

318–326

You are assuming that the sf blocks combined with the incoming block does not overlap and that may not be true. So we may have in the sack
filter:

[0] 100 - 200
[1] 210 - 310

You then get a sack of [190 - 220]

In such a case you will *not* catch it at the entry condition. And has you work through the filter you will
get the b->start/b>end modified to the point that b->start == b->end. Since we don't check it in the loop
anymore we must check it here. Of course you also want the protection that if the SACK is too small you
pretend its all consumed as well.

So this code must remain IMO. On top of that the 1 or 2 cache lines are hot here so having the checks cost
nothing.

581

Yes I thought about that and my initial take was to do tp->t_maxseg - 40.
However anything but the actual size we used (minus the normal options) is rare. Yes
you can send AccECN or SACK (especially DSACK) but that does not occur too often and
I thought it not worth doing after ruminating on it for a while.

We could go back to the -40 mechanism but I was just too concerned about the negotiated small MSS and
then shrinking as far as possible.

If you and Richard think a -40 is the way forward we could do that.. which in that case we can get rid of the
prototype above for user space.

rscheff added inline comments.
sys/netinet/tcp_stacks/sack_filter.c
318–326

Ah, now I see that. Agreed with all your points!

This revision is now accepted and ready to land.May 4 2024, 2:27 PM