Page MenuHomeFreeBSD

[tcp:] remove incorrect reset of sack variable in prr
ClosedPublic

Authored by rscheff on Mar 5 2021, 3:13 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Dec 7, 2:29 PM
Unknown Object (File)
Nov 22 2024, 10:51 AM
Unknown Object (File)
Nov 21 2024, 1:57 PM
Unknown Object (File)
Nov 21 2024, 1:57 PM
Unknown Object (File)
Nov 21 2024, 1:57 PM
Unknown Object (File)
Nov 21 2024, 1:57 PM
Unknown Object (File)
Nov 21 2024, 1:34 PM
Unknown Object (File)
Oct 6 2024, 3:40 AM
Subscribers

Details

Summary

Fix sporadic panic (PR 253848) when using PRR, which can happen
when t_dupack again equals to dupthesh while the old window of
loss recovery is not completely acknowledged.

Test Plan

In the process of creating a packetdrill script to excercise the
neccessary loss recovery pattern, to run into this issue.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 37606
Build 34495: arc lint + arc unit

Event Timeline

sys/netinet/tcp_sack.c
835 ↗(On Diff #85203)

Is this change part of the fix or just an unrelated cleanup (in which case it should be committed separately).

I'm fine with the change now. Just would like to know if this fixes the issue observed by rrs@.

yes, after this particular assignment was commented out (effectively) yesterday evening, to collect some detailed logs when this may occur, this change is the functional equivalend.

The initialization of sack_bytes_rexmit here is not really necessary - and when there are to consective loss recovery windows, old sack holes may carry over into the new recovery window (which is when the KASSERT would trigger).

While I still need to look at the side effects in detail, I expect the worst effect would be for PRR to be off in the transmission timing of a few (new) data segments, which may happen - from the instances known so far - within 3 segments / ACKs.

This seems to fix the panics that we were seeing. I used to see a panic almost instantly after reboot, as soon as the server started taking traffic. At this point, its been up for 30 minutes with no issues. Note that I'm testing the original patch.

This revision is now accepted and ready to land.Mar 5 2021, 4:30 PM

I just re-tested with the latest patch (the one line removal of tp->sackhint.sack_bytes_rexmit = 0) and it seems to also be working fine.