Page MenuHomeFreeBSD

Fix erroneous "DSACK" during loss recovery
ClosedPublic

Authored by rscheff on Mar 31 2020, 1:51 PM.

Details

Summary

On the receiver side, when new in-sequence data is
processed in tcp_reass(), a socket upcall will inform
the consumer of the newly arrived data. When more than
1/4 of the socket receive buffer is read by the
application, during that very same upcall, a window
update can be sent out.

If this window update happens during a SACK loss
recovery episode, and since the update to sackblks[] only
happens once tcp_reass() is finished, the window
update will contain old SACK information, while the ack
number is updated. Effective, such an (partial) ACK looks
like a DSACK.

Certain other stack have begun to dynamically adjust their
dupthresh or RTO limits based on information, if a receiver
informs the sender of unnecessary retransmissions. Such
a TCP sender is consuming the DSACK information to dynamically
adjust to an environment with high reordering extent.

Overall, the erroneous coupling of old SACK information
in a window update can lead to very delayed and inefficient
loss recovery for the remainder of an affected TCP session.

Thanks to Cheng for the significant effort to investigate
the root cause of this sporadic and hard to pinpoint
performance issue when sending data towards BSD.

Test Plan

Fill the tcp reassembly queue up to a utilization level of
more than 1/4, while maintaining one or two holes.

Issue a blocking read against the socket, before filling
the left most hole.

During the upcall and delivery of the newly arrived, in-
sequence data, a spurious window update with dated SACK
blocks can be elicited.

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 33518
Build 30786: arc lint + arc unit

Event Timeline

Questions:

  • Can this problem be observed with the normal socket API or do I need specific in kernel consumers?
  • Can you provide the call chain which sends the SACK? Temporary setting TF_DELACK around the tcp_reass() call looks like a hack. I would like to understand if it can't be made less hackish.
  • Didn't you mis BBR?

Questions:

  • Can this problem be observed with the normal socket API or do I need specific in kernel consumers?

The call stack we have can be set up with registering a function with soupcall_set on SO_RCV, which consumes the socket buffer directly; that will immediately trigger tcp_usr_rcvd(), where the window update happens. So, a generic userspace application could run into this - internally, we have a kernel consumer hooked into the soupcall() consuming the available data in the socket buffer.

I was not able to get this recreated consistently with packetdrill though (the async read call doesn't seem to get triggered quite as early as a soupcall() registered function.

  • Can you provide the call chain which sends the SACK? Temporary setting TF_DELACK around the tcp_reass() call looks like a hack. I would like to understand if it can't be made less hackish.

o) register a socket upcall function on receive (e.g. SoRecvUpcall(), that consumes all available data from the socket buffer directly) with soupcall_set. Then, the call stack looks something like this:

tcp_input()
tcp_do_segment()

tcp_reass()
 sowakeup()
 SoRecvUpcall() <- consuming socket buffer
  tcp_usr_rcvd()
   tcp_output()  <- sending out window update with old SACK blocks ("DSACK")
  • Didn't you mis BBR?

I didn't put it there, since I'm not sure about the state of the BBR stack's state upstream - so, yes, deliberately. Shall I start to port fixes to BBR going forward?

The kernel socket api does not explicitly forbid the use of a close-loop read from the socket buffer (or an equivalent direct manipulation of the socket buffer metadata), while in the socket upcall.

socket(9) man page

jtl requested changes to this revision.Apr 9 2020, 3:03 PM
jtl added a subscriber: jtl.

As discussed on the transport call today, please change both the in order and out of order data path so we call sorwakeup_locked() (when necessary) after the SACK blocks are updated.

This revision now requires changes to proceed.Apr 9 2020, 3:03 PM
  • move all socket wakeup calls as the final step in tcp_input processing
  • patch up bbr and rack stacks minimally
In D24237#535638, @jtl wrote:

As discussed on the transport call today, please change both the in order and out of order data path so we call sorwakeup_locked() (when necessary) after the SACK blocks are updated.

Done. Moved both rcv and snd upcalls to the end of the tcp_input() processing, while doing a minimal change for the BBR and RACK stacks.

During yesterdays transport@ call, we discussed this further. The ask is, to move these two flags into TCPCB, so that BBR / RACK can also fully participate in this, and then to have the socket upcall even further back, after relinquishing all the INP locks. It was pointed out, that the current in-kernel TCP consumers, where receive socket buffer processing is triggering a TCP window update, most likly only work, because the INP lock currently supports recursiveness, provided the same kernel thread aquires the lock multiple times. Therefore, a cleaner way of supporting similar functionality also when running TCP-over-UDP oder SCTP-over-UDP is to perform the socket upcall really only once all locks are released.

  • moving upcall flags to tcpcb and upcall to past INP has been unlocked. Also adding sowwakeup() support to BBR and RACK, and performing the potential upcalls only in one single place. Timing of SOW upcall may change some of the timings in RACK / BBR though, if new data becomes available while processing...
  • restore socket upcalls to BBR, to check for the panic
  • add check for connected socket and clean up diff
sys/netinet/tcp_stacks/bbr.c
11747

I am not sure this is right. When retval != 0 then it is possible (not always the case) that
the tp is no longer valid. It often is destroyed (or scheduled to be destroyed). Your check
in the wakeup function to make sure the socket is still connected may be good enough however
I need to think about that.

You are also missing changes in tcp_hpts.c

Let me find the lines for you.

Ok you need to add the tcp_handle_wakeup() in tcp_hpts.c after the calls to
tp->t_tfb->tfb_do_queued_segments()

in the function tcp_input_data() this is after it finishes releasing the reference I suspect.

For the other spot its in tcp_hptsi(), I think this should be near the tag skip_pacing
though if you goto that tag, then likely the tcb is destroyed.. not sure how this effects
the function.

  • adding socket upcall to ctf_do_queued_segments()
sys/netinet/tcp_stacks/bbr.c
11747

If tcpcb is destroyed, the socket would no longer be connected - which I check in the tcp_handle_wakeup() function using the so [socket], before trying to access tp [tcpcb].

What speaks against to adding the tcp_handle_wakeup() call in rack_bbr_common.c ctf_do_queued_segments() outside the close loop of handling the pending mbufs, rather than adding the socket handling in tcp_hpts.c?

@rrs - I believe you wanted to comment if the current layout of the relocated socket upcalls looks ok for the RACK stack.

rrs added inline comments.
sys/netinet/tcp_input.c
1480

How is it safe to reference the tp like this without a lock?

I think if you are wrong it just means you will do a wakeup (possibly) when
you should not.. so I guess thats ok.. note the same comment applies to
the so_state check since you don't hold a socket lock either.

This Diff has morphed over time quite dramatically; Originally only intended to fix one socket upcall at an inopportune time (using a hack that simply prevented the transmission of a premature ACK), it was suggested to revamp all socket upcalls throughout the stack.

What side effects are to expect when moving the lock release further down in the codepath (past the final socket upcalls)?

  • INP lock should be retained until socket wakeup calls are completed.

In general, this looks good. I have a small nit in that it seems like it would be worth considering whether it would be better to add the flag to the socket itself somehow so it could be synchronized by the socket lock. On the offchance someone did a socket operation which caused a wakeup while the TCP code was running, it seems possible that this might avoid a spurious wakeup. However, given the code in the src tree, I find it hard to reason through how this could occur.

This revision is now accepted and ready to land.Thu, Nov 5, 2:12 PM

For some reason, the Diff did not get automatically updated with the commit...
Adding commit r367492 manually, and manually closing Diff.