Page MenuHomeFreeBSD

fix two races in kernel handling of the nfsuserd daemon
ClosedPublic

Authored by rmacklem on Nov 15 2019, 1:09 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Oct 25, 12:52 AM
Unknown Object (File)
Thu, Oct 23, 2:40 PM
Unknown Object (File)
Sat, Oct 18, 11:10 PM
Unknown Object (File)
Fri, Oct 17, 5:33 PM
Unknown Object (File)
Fri, Oct 17, 2:37 AM
Unknown Object (File)
Wed, Oct 15, 2:10 AM
Unknown Object (File)
Mon, Oct 13, 1:14 PM
Unknown Object (File)
Sat, Sep 27, 6:28 AM
Subscribers

Details

Summary

avg@ reported a crash where the nr_client (krpc client structure ptr) was NULL during an
upcall to the nfsuserd daemon. nr_client is only NULL when the nfsuserd is shut down or
shutting down.

Upon inspection, two races were identified. The first is seems likely to be the cause of the crash.

  • At the beginning of nfsrv_getuser(), the nfsrv_nfsuserd variable was checked to see if the daemon was running, but then the upcall RPC was done sometime later via newnfs_request(), leaving a window during which nfsuserd shutdown could have occurred. --> This patch adds code to acquire a reference count on the structure pointed at by nr_client and holds that reference until after newnfs_request(), so that the structure will not be prematurely destroyed.

The second race was between a startup and shutdown of the daemon, since the variable nfsrv_nfsuserd was only
a binary running/not_running value that did not handle the periods of time during startup and shutdown.
--> This patch makes nfsrv_nfsuserd a tri-state variable (an enum), so that it can hold an intermediate state

during startup/shutdown, to prevent a race between a nfsuserd daemon shutdown and the startup of another
one.
Test Plan

I have done basic testing of the code, although I do not have a way to reproduce the crash at this time.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 27524

Event Timeline

fs/nfs/nfs_commonsubs.c
3569

Maybe it's better to move this line into the locked section?
Or it does not matter much?

3577

Maybe it's better to move this to after setting NOTRUNNING?
Not sure if it matters.

Replied to inline comments.

fs/nfs/nfs_commonsubs.c
3569

Actually, I don't think I needed to move the NFSUNLOCKNAMEID() at all.
It protects the nfsrv_nfsuserd variable. (The lock is a mutex, so it can't be held when
a malloc(M_WAITOK) is done.)

  • The only reason I moved it down was to ensure that nr_client was set NULL for all processes/threads on all cores. This probably doesn't matter, but might help w.r.t. debugging?

I'll admit that I've never been clear on exactly what has to be done to ensure a core
doesn't use a stale cached memory value, but I've been told a couple of things:
1 - jhb@ once said that holding a mutex while doing an unconditional assignment

isn't usually necessary, but he always did it, to be safe.

2 - alc@ described it in some detail and my recollection is that an unlock/lock of

a mutex (any mutex) is sufficient.
3577

Well, when nfsrv_nfsuserd is set NOTRUNNING, that allows a new nfsuserd
to start up, which could stomp on nr_nam. I think it is correct to do the free()
before that can happen and not after.

Updated my inline comment reply.

fs/nfs/nfs_commonsubs.c
3569

I think I'll move the

rp->nr_client = NULL;

line to above the rest of the field assignments and then unlock the mutex,
so it is clear that nr_client is the only field I care about w.r.t. the lock.
I might also add a comment clarifying that protecting nfsrv_nfsuserd by the lock
is what actually matters.

Move the NFSUNLOCKNAMEID() up so that it only updates the nr_client field
before unlocking and add a comment related to this.

This should not have any semantic effect, but will clarify why I moved the lock
down in the patch. If the code is correct, holding the lock while setting nr_client NULL
is harmless, but unnecessary.

Oops, I've realized there are other fields of nfsrv_nfsuserdsock that are
used by newnfs_request(), so holding a reference count on *nr_client isn't
sufficient.

This version of the patch keeps a count of upcalls in progress and then
nfsrv_nfsuserddelport() msleep()s until the count drops to zero before
shutting down the upcall socket. Since nfsrv_nfsuserd is set to STARTSTOP
while it is waiting, no new upcalls will be started.

Add a check for nfsrv_nfsuserd == STARTSTOP to the wakeup(), to avoid extraneous
wakeup() calls.
Also add a KASSERT() for nfsrv_userdupcalls.

I think this patch is now ready for testing/review.

This has been tested by Panzura QA.
They could not reproduce the original problem and no regressions were found.

This revision is now accepted and ready to land.Nov 28 2019, 6:38 AM