Page MenuHomeFreeBSD

Handle SIGPIPE in gssd, and limit kgssapi RPC retries
ClosedPublic

Authored by sef on Feb 11 2019, 11:52 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Dec 12, 1:17 PM
Unknown Object (File)
Sun, Dec 8, 2:15 AM
Unknown Object (File)
Mon, Dec 2, 9:51 AM
Unknown Object (File)
Mon, Dec 2, 9:42 AM
Unknown Object (File)
Mon, Dec 2, 9:42 AM
Unknown Object (File)
Mon, Dec 2, 9:42 AM
Unknown Object (File)
Mon, Dec 2, 8:57 AM
Unknown Object (File)
Tue, Nov 26, 12:31 AM
Subscribers

Details

Summary

We (iXsystems) have a case where the AF_LOCAL socket gssd uses gets closed, resulting in a subsequence I/O failing and generating SIGPIPE. Which this happens, it doesn't clean things up properly; a subsequent attempt to start nfsd results in it looping uninterruptably for INT_MAX times (at, as I recall, 30 seconds). This does not result in happiness.

While I have not yet identified the root cause, the two mitigations here are fairly obvious.

Handling SIGPIPE is a very obvious one; I won't object to the request that it handle most of the other default-fatal signals.

The limiting of retries for kgssapi is more questionable; however, I don't see any reason why it should be both infinite and non-interruptable, and limiting the retries seems fairly mundane.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

First off, I should note that I am not the author of this code. If dfr@ could review this, that could be better.

The SIGPIPE change looks fine to me.

The change to a relatively small number of retries also makes sense to me, since it is a could of reconnect
attempts on the AF_LOCAL sockets and reconnects shouldn't normally be needed, I think?
However, since this setting applies to all the RPCs, doing it in sys_gssd_syscall() when the "cl" is created
would make more sense to me than doing it in gss_accept_sec_context().

Took Rick's advice and moved the retry setting to a more specific place.

This one looks fine to me, rick.

This revision is now accepted and ready to land.Feb 12 2019, 2:36 PM
This revision was automatically updated to reflect the committed changes.

Somewhat belated question, would it make sense to mark the client created in sys_gssd_syscall() as interruptible?

We recently have got a problem at work when nfsd is stuck here on shutdown:

sched_switch+0x804 mi_switch+0xf5 sleepq_timedwait+0x42 _sleep+0x224 clnt_reconnect_call+0x396 clnt_call_private+0xb5 gssd_acquire_cred_1+0x47 gss_acquire_cred+0xa4 rpc_gss_acquire_svc_cred+0x8c rpc_gss_set_svc_name+0x95 nfsrvd_nfsd+0x11e nfssvc_nfsd+0x1c4 sys_nfssvc+0xa1 amd64_syscall+0x28e fast_syscall_common+0x101

Apparently, it could not be killed because the sleep did not catch signals.

That was before this change went in, so nfsd was stuck there "forever" (well, until a watchdog kicked in).
This change should help, of course, but it seems that catching signals would be even better.
That is, unless it can break some existing expectations about the behavior.

In D19153#415697, @avg wrote:

Somewhat belated question, would it make sense to mark the client created in sys_gssd_syscall() as interruptible?

I did that initially, but Rick was very wary about turning non-interruptible calls into interruptible ones. Although I think I ended up doing it at a different spot, before moving the retry count, so it may be worth doing that way.

My concern is that all sorts of signals get posted to processes and you don't want any old
signal to interrupt an attempt at an upcall. The failure of a gssd should be a rare occurrence
and so long as the upcall doesn't try "forever", I think that will be sufficient.

Also, for NFSv4, you don't want an RPC attempt on a server to be interrupted, since that
leaves the open/byte range lock state in an indeterminate state and that is not good.