Details

Reviewers

rmacklem
tuexen
rrs

Group Reviewers

transport

Commits

rG55cc0a478506: [tcp] Keep socket buffer locked until upcall
rG032bf749fd44: [tcp] Keep socket buffer locked until upcall

Summary

r367492 would unlock the socket buffer before eventually calling the upcall.
This leads to problematic interaction with NFS kernel server/client components
(MP threads) accessing the socket buffer with potentially not correctly updated
state.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 38798
Build 35687: arc lint + arc unit

Event Timeline

rscheff created this revision.Apr 10 2021, 9:15 AM

Herald added a reviewer: transport. · View Herald TranscriptApr 10 2021, 9:15 AM

Herald added subscribers: melifaro, imp. · View Herald Transcript

rscheff requested review of this revision.Apr 10 2021, 9:15 AM

Harbormaster completed remote builds in B38479: Diff 87181.Apr 10 2021, 9:15 AM

Since the kernel RPC (and, therefore kernel NFS) only uses
receive socket upcalls and not send ones, I can't see how
this partial revert will have an effect on NFS.

keep socket buffer locked until upcall

Harbormaster completed remote builds in B38489: Diff 87229.Apr 10 2021, 6:25 PM

rscheff retitled this revision from Back out central sowwakeup() upcall to [tcp] Keep socket buffer locked until upcall.Apr 10 2021, 6:27 PM

rscheff edited the summary of this revision. (Show Details)

I wonder (especially in RACK) if the SOW upcall shouldn't acrually happen while TCP processing is still ongoing, without holding the lock on so_snd until after all tcp processing is finished...

(Remember, the so_snd part only got added during discussions, more for symmetry rather than technical reasons. The receive upcall should really only happen after all TCP input processing is done -to address the interaction with SACK making it appear as is erraneous DSACKs were sent).

While the so_rcv upcall (where a window update would lead to the erraneous DSACK) , there are no similar issues known for so_snd; rather, the inline upcall may provide additional data, on which an ACK can be piggybacked (reducing the # of packets and also making processing possibly a little bit more efficient).

I have no knowledge of the uses of send upcall in the
kernel, so I cannot comment on that.

This patch *might* fix the problem I see (I can test
that). We will not know if reversion of r367492 fixes
the otis@ NFS client hangs for up to two weeks, since
it can take a week to happen.
Step 1 - Wait for otis@ to email whether or not the

revert fixes the problem.

Step 2 - After this, the question will be "can he test

this patch?".  He'll have to answer that.

I also have difficulty understanding why moving the
upcalls affects DSACKs?
Looking at the server side kernel RPC code, it appears
to be, at most, a few hundred machine instructions,
with no msleep()s in it.
The only delay might be when it acquires another
mutex, and that lock should not be heavily contended.
--> Does a microsecond delay in the upcall really

cause any problems?

Finally, I think changing the upcalls needs toi be
looked at by maintainers of the other uses of
them in /usr/src. This may be no small task.

keep so_rcv buffer locked until central upcall
remove the so_snd central upcall

Harbormaster completed remote builds in B38493: Diff 87237.Apr 11 2021, 9:07 AM

The problem occurs with in-kernel TCP users that manipulate the content of the socket buffer during the upcall right away (e.g. fetch new data from so_rcv) before the tcp processing is complete.

While rare in the HEAD in-kernel NFS server/client (although a race between the NFS worker thread and TCP/networking thread could conceivably cause this too), there are other in-kernel TCP user applications in existance, which trigger this consistently. These are using soreceive() more or less directly in the registered upcall handler.

What can happen - during SACK loss recovery - is, that after a partial ACK (some new data becomes available in-sequence after tcp reassembly), while SACK loss recovery is still on-going, the consumption of that data by the application during the upcall changes the TCP receive window. This causes tcp_output (triggered by soreceive(), all still in the same kernel thread ) to send out a window update, but the state of TCPCB is not fully updated (yes, there will be two locks - by the same thread - on tcpcb during this). In effect, the "window update" will send out stale SACK data, indicating to a Linux client, that the just successful retransmission was not necessary. And Linux clients will thereby lengthen their reordering interval / dupack threshhold - in extreme cases, the necessary retransmission for a loss may only happen many tens of seconds later, as each retransmission will trigger this erraneous DSACK, further lengthening the recovery time.

The alternative to not doing an immediate soupcall after tcp_reassembly would be major surgery of the tcp_input processing, restructuring it in such a way, that ACK processing is done prior to inbound data processing - a reversal of how things used to work for always.

sys/netinet/tcp_input.c
1896	This causes a doule-lock of so_snd later in the processing. Need to protect it by checking the flag first. (381)

In D29690#666038, @rmacklem wrote:
I have no knowledge of the uses of send upcall in the
kernel, so I cannot comment on that.

This patch *might* fix the problem I see (I can test
that). We will not know if reversion of r367492 fixes
the otis@ NFS client hangs for up to two weeks, since
it can take a week to happen.
Step 1 - Wait for otis@ to email whether or not the
revert fixes the problem.

The revert seem to fix the problem. It's now 4 days without interruption. Before, the exact moment of interruption was unpredictable.

Step 2 - After this, the question will be "can he test
this patch?".  He'll have to answer that.

Is it desirable to also test is diff? If yes, I can build another boot environment from stable/13 with this diff applied.

If I knew about this work before, we could have it already tested :-)

In D29690#668141, @otis wrote:

Is it desirable to also test is diff? If yes, I can build another boot environment from stable/13 with this diff applied.

If I knew about this work before, we could have it already tested :-)

Yes, please - since I do not have an environment where these NFS hangs can be replicated.

At the time Rick suggested the revert, he and I had some additional conversations. This diff (D29690) tries to accomodate the needs of the TCP stack (upcall only after all incoming-packet processing is finalized) but keep the socket buffer locked until that time, so that no race condition can occur and confuse some RPC processing. The earlier patch released the socket buffer lock in between (to allow the consumer thread to do work) but that may be the reason for the unresponsiveness reported by the NFS client.

Well, this is an alternative to reverting r367492.
It would be nice to make sure that reverting
r367492 does resolve the problem before
switching to this patch.

I recall that it took a week for your hang to
recur on one occasion. This suggests at least
one more week of testing revert of r367492
is desirable.

Of course, if you have two servers that you
can test concurrently, then that would be nice.

I get a panic "_mtx_lock_sleep: recursed on non-recursive mutex"
when socantrcvmore() does SOCKBUF_LOCK() with this patch.

It was only on-screen, but here's the stack trace:
mtx_lock_sleep
mtx_lock_flags
socantrcvmore
tcp_do_seqment
tcp_input
ip_input
netisr_dispatch_src
ether_demux
ether_nh_input
netisr_dispatch_src
ether_input
tcp_flush_out_le
tcp_lro_flush
tcp_lro_rx_done
tcp_lro_flush_all
iflib_rxeof
_task_fn_rx
gtaskqueue_run_locked
gtaskqueue_thread_loop
fork_exit
fork_trampoline

deal with pending upcall on FIN correctly

Harbormaster completed remote builds in B38634: Diff 87596.Apr 16 2021, 11:43 PM

In D29690#668536, @rmacklem wrote:

I get a panic "_mtx_lock_sleep: recursed on non-recursive mutex"
when socantrcvmore() does SOCKBUF_LOCK() with this patch.

It was only on-screen, but here's the stack trace:
mtx_lock_sleep
mtx_lock_flags
socantrcvmore
tcp_do_seqment

Thanks, an oversight in the FIN handling, when now there may be a pending upcall. Need to call the correct socanrcvmore or socanrcvmore_locked, depending on the state of our pending upcall flag (line 3215). Sorry.

also do this for rack and bbr

Harbormaster completed remote builds in B38635: Diff 87598.Apr 17 2021, 12:19 AM

This variant has the following LOR in it:
lock order reversal:
1st 0xfffff80005f378a8 so_rcv (so_rcv, sleep mutex) @ /usr/src/sys/netinet/tcp_input.c:3158
2nd 0xfffff80005f379c0 so_snd (so_snd, sleep mutex) @ /usr/src/sys/netinet/tcp_output.c:352
lock order so_snd -> so_rcv established at:
#0 0xffffffff80c7c08d at witness_checkorder+0x46d
#1 0xffffffff80be9e44 at __mtx_lock_flags+0x94
#2 0xffffffff80cb0eee at soreserve+0x5e
#3 0xffffffff80e3d09b at udp6_attach+0x7b
#4 0xffffffff80cb318d at socreate+0x1ad
#5 0xffffffff80cbc1d8 at kern_socket+0xb8
#6 0xffffffff810bd30e at amd64_syscall+0x12e
#7 0xffffffff810907de at fast_syscall_common+0xf8
lock order so_rcv -> so_snd attempted at:

#0 0xffffffff80c7c9ec at witness_checkorder+0xdcc
#1 0xffffffff80be9e44 at __mtx_lock_flags+0x94
#2 0xffffffff80de7336 at tcp_output+0x3c6
#3 0xffffffff80ddf4ca at tcp_do_segment+0x2baa
#4 0xffffffff80ddb9f3 at tcp_input+0xb83
#5 0xffffffff80dcd5b4 at ip_input+0x194
#6 0xffffffff80d53b41 at netisr_dispatch_src+0xb1
#7 0xffffffff80d365be at ether_demux+0x17e
#8 0xffffffff80d37c4c at ether_nh_input+0x40c
#9 0xffffffff80d53b41 at netisr_dispatch_src+0xb1
#10 0xffffffff80d36ab1 at ether_input+0xa1
#11 0xffffffff8091cbe2 at re_rxeof+0x612
#12 0xffffffff8091a3d9 at re_intr_msi+0xc9
#13 0xffffffff80bcae49 at ithread_loop+0x279
#14 0xffffffff80bc7900 at fork_exit+0x80
#15 0xffffffff81090f3e at fork_trampoline+0xe

I have no idea if it can result in deadlock.

I have just started testing it and will report
how it goes.

Got this crash while trying to "git", not when
actually testing NFS, although the nfsd was running
(but with no mounts against it):

panic: mutex so_rcv not owned at /usr/src/sys/netinet/tcp_input.c:1482
__mtx_assert()
tcp_handle_wakeup()
tcp_do_segment()
tcp_input()
ip_input()
netisr_dispatch_src()
ether_demux()
ether_nh_input()
netisr_dispatch_src()
ether_input()
re_rxeof()
re_intr_msi()
ithread_loop()
fork_exit()
fork_trampoline()

remove unlock from tcp_reass() and more logging
address lock order reversal by doing the upcall before tcp_output() calls

Harbormaster completed remote builds in B38664: Diff 87647.Apr 18 2021, 12:22 PM

With this one I get:
tcp_do_segment#3292: WAKESOR left over
during mount on the client end.

Since my kernel is not right up to date,
the line numbers will be different. Here's
the code snippet for line 3292.

		/*
		 * In FIN_WAIT_2 state enter the TIME_WAIT state,
		 * starting the time-wait timer, turning off the other
		 * standard timers.
		 */
		case TCPS_FIN_WAIT_2:
			tcp_twstart(tp);
			if ((tp != NULL) && (tp->t_flags & TF_WAKESOR))
			log(2, "%s#%d: WAKESOR left over\n", __func__,__LINE__);
			return;
		}

Thanks!

I did observe these leftovers locally also, but in both instances when there is a tcp state transition to closing / fin_wait. Do you see any odd timing now - e.g the server or client taking longer for a reconnect when the L2 and client interface are severed and restored?

I still need to check the full codepath for where an upcall may be sensible without sending stale tcp data. These codepaths are all mainly in reorder/loss recovery, though...

There may be some instances of lock order reversal - if there is bidirectional traffic going on (not very likely, but one could do a 50/50 read/write IO load across an NFS mount, and approximate this reasonably well).

(and to answer an earlier question: I did see my ssh session to get stuck on a lock order reversal instance one time - so getting them correct is necessary.)

I only see the "left over" message on the NFS client
during mounts and umounts and the timing/completion
of these do not seem to be affected.

The problem I do see is on the NFS server end:

If I network partition the NFS client (FreeBSD one)

from the NFS server for long enough that the
client attempts a new connection (a couple of
minutes, indicated by "netstat -a" showing
SYN_SENT on the client end) and then heal
the network partition...
The client does establish a working new connection.

However, if I do "netstat -a" on the NFS server..
--> If the server has this patch, the old TCP
connection remains in ESTABLISHED state
*forever*. (I waited almost 30minutes after
healing and it was still ESTABLISHED.)
but, if I do the same test with the NFS server
running the same kernel except for r367492
being reverted instead of this patch...
--> The old TCP connection goes to CLOSED
almost immediately after the network
partition is healed and disappears from
"netstat -a" shortly thereafter.
--> My guess is that this patch fails to do
the socket upcall that triggers the
soshutdown(..SHUT_RW) call in the krpc.

reduce so_rcv lock time as much as possible

Harbormaster completed remote builds in B38798: Diff 87970.Apr 22 2021, 1:25 PM

otis added inline comments.Apr 22 2021, 1:28 PM

sys/netinet/tcp_input.c
2636	perhaps getting rid of these could reduce diff?

This diff is currently WIP - with extra logging scattered, and not all remnants, where tcp_handle_wakeup() is called, removed. The reason is that I lack an environemnt in which the Problem around NFS server becoming unresponsive, or entering a RST fight, can be replicated.

In D29690#671446, @rscheff wrote:

This diff is currently WIP - with extra logging scattered, and not all remnants, where tcp_handle_wakeup() is called, removed. The reason is that I lack an environemnt in which the Problem around NFS server becoming unresponsive, or entering a RST fight, can be replicated.

As I was suffering some weird nfsd behavior recently (you know, the hung Linux client), I can arrange a testing. I will discuss with my coworkers whether we can arrange an access to affected NFS server. My wild guess is, that we can.

Thank you. Ideally, you should obviouly no longer expirience any hungs, but also the dmesg should be void of any entries like "WAKESOR left over" or "handling WAKESOR finally".

do not bail out of upcall for disconnected socket
track each lingering wakeup call

Harbormaster completed remote builds in B38828: Diff 88040.Apr 23 2021, 11:55 AM

My testing went well this time.

Everything seems to work as I'd expect it and I do see the
"tcp_handle_wakeup: socket is no longer connected, so_rcv is locked. called from: 0xffffffff80da4c16S, set at 1"
log messages as you predicted.

No panic()s nor LORs this time.

Of course, as noted, I cannot reproduce the hangs that otis@ sees.

rick