Details

Reviewers

cc
rgrimes
tuexen
rrs
jtl

Group Reviewers

transport

Commits

rS367492: Prevent premature SACK block transmission during loss recovery

Summary

On the receiver side, when new in-sequence data is
processed in tcp_reass(), a socket upcall will inform
the consumer of the newly arrived data. When more than
1/4 of the socket receive buffer is read by the
application, during that very same upcall, a window
update can be sent out.

If this window update happens during a SACK loss
recovery episode, and since the update to sackblks[] only
happens once tcp_reass() is finished, the window
update will contain old SACK information, while the ack
number is updated. Effective, such an (partial) ACK looks
like a DSACK.

Certain other stack have begun to dynamically adjust their
dupthresh or RTO limits based on information, if a receiver
informs the sender of unnecessary retransmissions. Such
a TCP sender is consuming the DSACK information to dynamically
adjust to an environment with high reordering extent.

Overall, the erroneous coupling of old SACK information
in a window update can lead to very delayed and inefficient
loss recovery for the remainder of an affected TCP session.

Thanks to Cheng for the significant effort to investigate
the root cause of this sporadic and hard to pinpoint
performance issue when sending data towards BSD.

Test Plan

Fill the tcp reassembly queue up to a utilization level of
more than 1/4, while maintaining one or two holes.

Issue a blocking read against the socket, before filling
the left most hole.

During the upcall and delivery of the newly arrived, in-
sequence data, a spurious window update with dated SACK
blocks can be elicited.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 31048
Build 28751: arc lint + arc unit

Event Timeline

rscheff created this revision.Mar 31 2020, 1:51 PM

Herald added a reviewer: transport. · View Herald TranscriptMar 31 2020, 1:51 PM

Herald added a subscriber: melifaro. · View Herald Transcript

Harbormaster completed remote builds in B30206: Diff 70055.Mar 31 2020, 1:51 PM

Questions:

Can this problem be observed with the normal socket API or do I need specific in kernel consumers?
Can you provide the call chain which sends the SACK? Temporary setting TF_DELACK around the tcp_reass() call looks like a hack. I would like to understand if it can't be made less hackish.
Didn't you mis BBR?

In D24237#533374, @tuexen wrote:

Questions:

Can this problem be observed with the normal socket API or do I need specific in kernel consumers?

The call stack we have can be set up with registering a function with soupcall_set on SO_RCV, which consumes the socket buffer directly; that will immediately trigger tcp_usr_rcvd(), where the window update happens. So, a generic userspace application could run into this - internally, we have a kernel consumer hooked into the soupcall() consuming the available data in the socket buffer.

I was not able to get this recreated consistently with packetdrill though (the async read call doesn't seem to get triggered quite as early as a soupcall() registered function.

Can you provide the call chain which sends the SACK? Temporary setting TF_DELACK around the tcp_reass() call looks like a hack. I would like to understand if it can't be made less hackish.

o) register a socket upcall function on receive (e.g. SoRecvUpcall(), that consumes all available data from the socket buffer directly) with soupcall_set. Then, the call stack looks something like this:

tcp_input()
tcp_do_segment()

tcp_reass()
 sowakeup()
 SoRecvUpcall() <- consuming socket buffer
  tcp_usr_rcvd()
   tcp_output()  <- sending out window update with old SACK blocks ("DSACK")

Didn't you mis BBR?

I didn't put it there, since I'm not sure about the state of the BBR stack's state upstream - so, yes, deliberately. Shall I start to port fixes to BBR going forward?

The kernel socket api does not explicitly forbid the use of a close-loop read from the socket buffer (or an equivalent direct manipulation of the socket buffer metadata), while in the socket upcall.

socket(9) man page

As discussed on the transport call today, please change both the in order and out of order data path so we call sorwakeup_locked() (when necessary) after the SACK blocks are updated.

This revision now requires changes to proceed.Apr 9 2020, 3:03 PM

move all socket wakeup calls as the final step in tcp_input processing
patch up bbr and rack stacks minimally

Harbormaster completed remote builds in B30550: Diff 70669.Apr 17 2020, 12:16 AM

In D24237#535638, @jtl wrote:

As discussed on the transport call today, please change both the in order and out of order data path so we call sorwakeup_locked() (when necessary) after the SACK blocks are updated.

Done. Moved both rcv and snd upcalls to the end of the tcp_input() processing, while doing a minimal change for the BBR and RACK stacks.

During yesterdays transport@ call, we discussed this further. The ask is, to move these two flags into TCPCB, so that BBR / RACK can also fully participate in this, and then to have the socket upcall even further back, after relinquishing all the INP locks. It was pointed out, that the current in-kernel TCP consumers, where receive socket buffer processing is triggering a TCP window update, most likly only work, because the INP lock currently supports recursiveness, provided the same kernel thread aquires the lock multiple times. Therefore, a cleaner way of supporting similar functionality also when running TCP-over-UDP oder SCTP-over-UDP is to perform the socket upcall really only once all locks are released.

moving upcall flags to tcpcb and upcall to past INP has been unlocked. Also adding sowwakeup() support to BBR and RACK, and performing the potential upcalls only in one single place. Timing of SOW upcall may change some of the timings in RACK / BBR though, if new data becomes available while processing...

Harbormaster completed remote builds in B31048: Diff 71689.May 12 2020, 4:07 PM

restore socket upcalls to BBR, to check for the panic
add check for connected socket and clean up diff

Harbormaster completed remote builds in B31103: Diff 71836.May 15 2020, 7:50 PM

rrs added inline comments.May 21 2020, 3:35 PM

sys/netinet/tcp_stacks/bbr.c
11862	I am not sure this is right. When retval != 0 then it is possible (not always the case) that the tp is no longer valid. It often is destroyed (or scheduled to be destroyed). Your check in the wakeup function to make sure the socket is still connected may be good enough however I need to think about that. You are also missing changes in tcp_hpts.c Let me find the lines for you. Ok you need to add the tcp_handle_wakeup() in tcp_hpts.c after the calls to tp->t_tfb->tfb_do_queued_segments() in the function tcp_input_data() this is after it finishes releasing the reference I suspect. For the other spot its in tcp_hptsi(), I think this should be near the tag skip_pacing though if you goto that tag, then likely the tcb is destroyed.. not sure how this effects the function.

adding socket upcall to ctf_do_queued_segments()

Harbormaster completed remote builds in B31398: Diff 72458.May 30 2020, 8:50 PM

rscheff added inline comments.May 30 2020, 8:52 PM

sys/netinet/tcp_stacks/bbr.c
11862	If tcpcb is destroyed, the socket would no longer be connected - which I check in the tcp_handle_wakeup() function using the so [socket], before trying to access tp [tcpcb]. What speaks against to adding the tcp_handle_wakeup() call in rack_bbr_common.c ctf_do_queued_segments() outside the close loop of handling the pending mbufs, rather than adding the socket handling in tcp_hpts.c?

@rrs - I believe you wanted to comment if the current layout of the relocated socket upcalls looks ok for the RACK stack.

rrs accepted this revision.Aug 13 2020, 4:44 PM

rrs added inline comments.

sys/netinet/tcp_input.c
1479	How is it safe to reference the tp like this without a lock? I think if you are wrong it just means you will do a wakeup (possibly) when you should not.. so I guess thats ok.. note the same comment applies to the so_state check since you don't hold a socket lock either.

This Diff has morphed over time quite dramatically; Originally only intended to fix one socket upcall at an inopportune time (using a hack that simply prevented the transmission of a premature ACK), it was suggested to revamp all socket upcalls throughout the stack.

What side effects are to expect when moving the lock release further down in the codepath (past the final socket upcalls)?

INP lock should be retained until socket wakeup calls are completed.

Harbormaster completed remote builds in B33518: Diff 76942.Sep 12 2020, 9:51 AM

rrs accepted this revision.Oct 22 2020, 2:22 PM

In general, this looks good. I have a small nit in that it seems like it would be worth considering whether it would be better to add the flag to the socket itself somehow so it could be synchronized by the socket lock. On the offchance someone did a socket operation which caused a wakeup while the TCP code was running, it seems possible that this might avoid a spurious wakeup. However, given the code in the src tree, I find it hard to reason through how this could occur.

This revision is now accepted and ready to land.Nov 5 2020, 2:12 PM

For some reason, the Diff did not get automatically updated with the commit...
Adding commit r367492 manually, and manually closing Diff.

rscheff mentioned this in D30390: rack: honor prior socket buffer lock when doing the upcall.May 21 2021, 9:30 PM

rscheff mentioned this in rG39756885633f: rack: honor prior socket buffer lock when doing the upcall.May 21 2021, 10:16 PM

rscheff mentioned this in rG0230e6cf56eb: rack: honor prior socket buffer lock when doing the upcall.Jun 9 2021, 11:14 AM

Fix erroneous "DSACK" during loss recovery
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 71689

sys/netinet/tcp_input.c

sys/netinet/tcp_reass.c

sys/netinet/tcp_stacks/bbr.c

sys/netinet/tcp_stacks/rack.c

sys/netinet/tcp_var.h

Fix erroneous "DSACK" during loss recoveryClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 71689

sys/netinet/tcp_input.c

sys/netinet/tcp_reass.c

sys/netinet/tcp_stacks/bbr.c

sys/netinet/tcp_stacks/rack.c

sys/netinet/tcp_var.h

Fix erroneous "DSACK" during loss recovery
ClosedPublic
Actions

Revision Contents
Changeset List