Page MenuHomeFreeBSD

when CLSET_TIMEOUT is used, set timeouts for sosend() and soconnect() in the krpc
ClosedPublic

Authored by rmacklem on Jul 16 2018, 8:54 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Nov 15, 6:36 AM
Unknown Object (File)
Wed, Nov 6, 8:36 AM
Unknown Object (File)
Wed, Nov 6, 12:54 AM
Unknown Object (File)
Mon, Nov 4, 12:22 PM
Unknown Object (File)
Fri, Nov 1, 9:16 AM
Unknown Object (File)
Oct 4 2024, 10:53 PM
Unknown Object (File)
Oct 3 2024, 9:42 AM
Unknown Object (File)
Oct 1 2024, 12:30 PM
Subscribers

Details

Summary

When testing the NFSv4.1 client for pNFS, it can take a long time for the client to fail an RPC
to a DS that has failed. The usual place where there is a long delay is in the code waiting for
a TCP connection to complete.
I also observed at least one case where there was a long delay in sosend() on the TCP socket.

This patch sets SO_SNDTIMEO when a timeout is set via CLSET_TIMEOUT and this is also used
to set a timeout on msleep() waiting for a TCP connect to complete. (Using SO_SNDTIMEO avoids
a change to the krpc's kpi, so that this patch can be MFC's easily.)

Since CLSET_TIMEOUT is not currently used by any consumer of the krpc in the source tree, this
patch should have no effect on extant uses (NFS and NLM) of the krpc.
After this patch is committed, it would be used by the NFS code specifically for connections to DSs.
(DS refers to a pNFS Data Server.)

Test Plan

I have been testing this in the pNFS code for some time and I can get pretty
consistent timeouts for RPCs attempted against failed DSs. The RPC will
fail in about twice the timeout value (because there is a failure plus an
additional retry).

My testing usually consists of disabling the network for the DS via ifconfig.
I have not been able to test a failure across a WAN interconnect, so I don't
know if that will fail as quickly, but at least the MDS->DS connections will
normally be LAN connections.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 18187

Event Timeline

rpc/clnt_rc.c
178

So you disallow timeouts shorter than a second ?

I added an inline comment which was basically that I didn't think a timeout of a fraction
of 1sec or setting a timeout of greater than 1sec resolution would be useful.
(My assumption is that normal operation against a slow server could result
in delays of 1sec.)

rpc/clnt_rc.c
178

I felt that timeouts less than 1sec or of a resolution greater than the number of
seconds would not make much sense, since I thought there would be delays on
the order of 1sec during normal operation, when a server is slow.

The use I make of it is setting it to 15-30sec.

I also used a rather arbitrary 600sec as a sanity upper bound in the related
change in clnt_vc.c.

If anyone thinks this should be changed (either including the tv_usec field in the
timeout calculation or allowing a fraction of 1sec timeout) I can easily change it?

rpc/clnt_rc.c
178

No, I think this is fine, just add an explanation in a comment, please.

rpc/clnt_vc.c
182 ↗(On Diff #45392)

Why not use 1 sec there for timeout ? You make the sleep timeout proportional to the socket timeout, but I do not see why it would be useful. We want to get out of loop early, your method causes pause up to 2* timeout.

Explained inline, but basically timeouts at both places count as one retry
and the RPC attempt fails when the limit on these retries is reached,
so it seems to me that they should be the same amount of time?

Also, 1sec doesn't "feel" long enough to wait for a TCP connect to
complete. With my limited knowledge of TCP I don't know how long
to wait so that 99.9% of connection attempts will succeed, assuming
a functioning server and network interconnect?

rpc/clnt_vc.c
182 ↗(On Diff #45392)

Well, the function called to do an RPC is clnt_reconnect_call(). In it there is a loop:
do while

  • if not connected, call clnt_reconnect_call(), which ends up here.
    • if this times out, count it as a retry (variable is "tries" and compared with a limit of rc_retries. If at rc_retries, exit loop.
  • calls CLNT_MBUF_CALL(), which does the sosend()
  • if this returns an error due to timeout, count it as a retry, same as above.

As such, a timeout for either case counts as a retry, so it seems to me that they should
be the same amount of time. If this one is reduced to 1sec, then the rc_retries needs
to be increased (it is the "retrans=N" mount option) to result in the same amount of
time before the function gives up and fails the RPC, but then a timeout at sosend()
would only count as one of these retries.

rpc/clnt_vc.c
182 ↗(On Diff #45392)

Well, ideally you would check the real time spent in the wait. There could be spurious wakeups, which would trigger the undeserved error. Also, if timeout has non-zero usec part, you are ignoring it.

IMO doing the sleeps in 1 sec (hz) steps would eliminate all that problems, of course EWOULDBLOCK needs to be ignored and real time check used as the timeout indication, instead.

I commented inline. I'll admit I don't see spurious wakeup()s will be a problem.
I can (actually already have for the patch I have) include the tv_usec field in
the timeout calculation.
(I'll admit I get nervous fooling with things like time_uptime, which can go
negative in 2038 and assorted things related to overflow.)

rpc/clnt_vc.c
182 ↗(On Diff #45392)

I can easily add code to include the tv_usec component (actually I have already done that
in the code I have by using tvtohz() which is what the code uses in other places).

Do you really think spurious wakup()s are a problem. I grep'd for wakeup()s on so_timeo
and found 4 of them in uipc_socket.c. Three of them definitely clear SS_ISCONNECTING
before the wakeup(), so those cases will exit the loop. The only case that might not is
in soshutdown(), which I don't think would be used and I can't see why you'd want to
keep looping in that case.

Added comments as kib@ suggested and used tvtohz(), so that tv_usec component is included in timeout.
(I have emailed kib@ one that uses the 1sec timeout and I put that here if he prefers that one.)

Explained inline, but basically timeouts at both places count as one retry
and the RPC attempt fails when the limit on these retries is reached,
so it seems to me that they should be the same amount of time?

Also, 1sec doesn't "feel" long enough to wait for a TCP connect to
complete. With my limited knowledge of TCP I don't know how long
to wait so that 99.9% of connection attempts will succeed, assuming
a functioning server and network interconnect?

The second should be not a timeout for the connection, but an interval for re-checking the condition. The check times out when the real clock interval passes.

Also, does the timeout in the msleep(9) call needed, wouldn't the socket code wake up the waiter ?

The problem is that, without the "timeout" it takes too long for the msleep() to wake up.
The "timeout" or whatever you prefer to call it is meant to set the upper bound on the
time that this loop waits for a TCP connection to complete to approximately
tv.tv_sec (the same timeout as set for SO_SNDTIMEO).

Without this, it can take several minutes for the NFS code to see that the RPC has failed.
With this patch, it takes roughly twice the tv_sec (the twice is because it does a retry).
The reason I want the time to be limited is that the pNFS service is "hung" until it sees
the RPC has failed and then disables the mirrored DS (so the other mirror continues
to work.

I suppose this patch isn't critical. It just seems that 5-7minutes is too long to wait for
the failover to the mirrored DS.

I just ran a test without the timeout in the loop waiting for a connection
attempt and the time it took for a failover was only 1minute more than
with that patch.

I think this time to fail over is good enough, so this patch drops that
part and only includes setting SO_SNDTIMEO.

Kostik, are you ok with this version, since it has the comment you suggested?

This revision is now accepted and ready to land.Jul 20 2018, 8:07 AM
This revision was automatically updated to reflect the committed changes.