Page MenuHomeFreeBSD

mark which jail did exports via a reference to the credentials
ClosedPublic

Authored by rmacklem on Feb 3 2023, 3:03 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Apr 21, 4:14 AM
Unknown Object (File)
Sun, Apr 21, 4:14 AM
Unknown Object (File)
Sun, Apr 21, 4:12 AM
Unknown Object (File)
Sun, Apr 21, 4:04 AM
Unknown Object (File)
Fri, Apr 19, 3:06 PM
Unknown Object (File)
Fri, Apr 19, 3:06 PM
Unknown Object (File)
Fri, Apr 19, 3:06 PM
Unknown Object (File)
Fri, Apr 19, 3:02 PM
Subscribers

Details

Summary

Here is an alternative to D38144 using a reference to the
credentials that did the exports. Although reviewers
suggested this, I resisted because I was thinking I had
to get rid of exports when a jail is removed and that
was too difficult.

I realized that all I needed to do was release the credentials
and set mnt_exjail = NULL. Then, deletion/replacement of
the exports can be done "lazily" when vfs_export() is called.

I defined a function called vfs_exjail_delete(), which I currently
call from the nfsd'd OSD PR_MOETHOD_REMOVE which
releases credentials and set mnt_exjail NULL for all cases
matching the prison argument.
Maybe this function should be called from within kern_jail.c
via prison_cleanup()?

I think this is a better alternative to D38144 and if this
looks good, I will revert the commit of D38144 and commit
this instead. Thanks to all that hinted this was the way to go.

As they say "if at first you don't succeed, try try again...".

Test Plan

Testing is in progress using a system patched so
that mountd/nfsd can be run in vnet prisons.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

I would expect killing the jail to *fail* if there any exports active. You can keep the counter of them in struct prison to avoid the need to scan anything.

With the assumption that nfsd in jail is *disabled* by default, this would not violate POLA.

In D38371#872242, @mjg wrote:

I would expect killing the jail to *fail* if there any exports active. You can keep the counter of them in struct prison to avoid the need to scan anything.

With the assumption that nfsd in jail is *disabled* by default, this would not violate POLA.

I don't want killing a jail to fail. At present, the only failures for jail_remove(2) are permission and nonexistent jail. Considering the most common situation for jail removal is system shutdown, its failure could be a problem best avoided.

I defined a function called vfs_exjail_delete(), which I currently
call from the nfsd'd OSD PR_MOETHOD_REMOVE which
releases credentials and set mnt_exjail NULL for all cases
matching the prison argument.
Maybe this function should be called from within kern_jail.c
via prison_cleanup()?

Yes, I think so. The OSD avenue worked, but since we can't have VNET nfsd in module form anyway (which is what the OSD is best suited for), you might as well avoid its extra locking layer and go with prison_cleanup.

In D38371#872242, @mjg wrote:

I would expect killing the jail to *fail* if there any exports active. You can keep the counter of them in struct prison to avoid the need to scan anything.

With the assumption that nfsd in jail is *disabled* by default, this would not violate POLA.

I don't want killing a jail to fail. At present, the only failures for jail_remove(2) are permission and nonexistent jail. Considering the most common situation for jail removal is system shutdown, its failure could be a problem best avoided.

I think that's questionable. To illustrate consider 'rmdir' which fails if the directory is not empty -- I think it is a feature. Similarly, if appropriate (tm) clean up was not performed, killing the jail could fail too.

But if you insist on it, shutting the jail down should probably unexport everything and it does not with the proposed patch. It only cleans some creds.

Ok, I will try and address the various comments.

mountd(8) does not delete exports when it is terminated.
That was by design, so that, if mountd(8) died accidentally
(I did that with a typing error back when I was a sysadmin),
nfsd would continue to run and all a sysadmin had to do is
restart mountd. (It's even less of an issue for NFSv4, since
mountd(8) is not needed for new mounts to be done.)

In general, nothing gets rid of exports unless the file system
is dismounted (it actually leaks the structures, but that's a
bug for another day). mountd(8) will delete them when it
first starts up and will delete ones taken out of the exports(5)
file when it is relaoded (via a SIGHUP).

So, having exports at jail shutdown is normal and expected,
if mountd(8) has been running in the prison.

To get rid of the actual exports during shutdown is very messy.
You have to acquire a lockmgr() lock and you can't do that while
looping holding the moutnlist lock.
So, the patch clears mnt_exjail, which indicates that the exports
are now stale, then they go away when mountd(8) updates them
or the file system gets dismounted. This makes them non-usable,
only the structures have not been free'd yet, so it does "get rid of them".
If you don't get rid of the cred references, then the jail will get
stuck in dying state, as I understand it.
Note that the root fs for a jail will have been mounted outside of
the jail, so it doesn't get dismounted during "jail -r".

nfsd.ko does now load dynamically for the VNET_NFSD case.
It was fixed once I malloc'd the arrays and structures instead of
putting them in the vnet.
(kern_jail.c does need to be updated for this, but I was waiting
until everything else goes in main and then I was going to take
the defined(VNET_NFSD) and defined(NFSD) out of kern_jail.c
to enable it all.)

The only advantage of moving the vfs_exjail_delete() call to
prison_cleanup() is that the nfs cleanup function could them
be called from VNET_SYSUNIINIT() instead of OSD once it
is fixed so that it works. (Recall that "jail -r" on vnet prisons
does not currently work and leaves them in dying state.)

And, yes, adding a counter of exports to struct prison would
allow us to avoid (or break out of the scan) early, so I agree
that is a optimization worth coding for the next version of the patch.

So, having exports at jail shutdown is normal and expected,
if mountd(8) has been running in the prison.

This is a non-starter. Shutting down a jail has to be conceptually seen as shutting down the system, except with effects only limited to the jail at hand.

To get rid of the actual exports during shutdown is very messy.
You have to acquire a lockmgr() lock and you can't do that while
looping holding the moutnlist lock.

This bit is not a problem, see kern_getfsstat for an example.

Preferably there would be no linear scans to begin with, let alone over the global mount list.

Unfortunately there are several spots which open-code the traversal which makes it harder to introduce basic sanity in the area, but I'll look into it.

That said, I think the pragmatic course of action right now is to get another loan from the Technical Bank and copy the aforementioned traversal from kern_getfsstat, while damage-controlling it by counting per-jail exports (the counter should probably be modified in spots which ref/unref the mnt_exjail cred).

btw while grepping for mountlist I found this bit:

int
nfsrv_deldsserver(int op, char *dspathp, NFSPROC_T *p)
[..]
        mtx_lock(&mountlist_mtx);
        TAILQ_FOREACH(mp, &mountlist, mnt_list) {
                if (strcmp(mp->mnt_stat.f_mntonname, dspathp) == 0 &&
                    strcmp(mp->mnt_stat.f_fstypename, "nfs") == 0 &&
                    mp->mnt_data != NULL) {
                        nmp = VFSTONFS(mp);
                        NFSLOCKMNT(nmp);
                        if ((nmp->nm_privflag & (NFSMNTP_FORCEDISM |
                             NFSMNTP_CANCELRPCS)) == 0) {
                                nmp->nm_privflag |= NFSMNTP_CANCELRPCS;
                                NFSUNLOCKMNT(nmp);
                        } else {
                                NFSUNLOCKMNT(nmp);
                                nmp = NULL;
                        }
                        break;
                }
        }
        mtx_unlock(&mountlist_mtx);

The caller, nfssvc_nfsd, grabs the path from userspace and passes it through:

error = copyin(uap->argp, &pnfsdarg, sizeof(pnfsdarg));
if (error == 0 && (pnfsdarg.op == PNFSDOP_DELDSSERVER ||
    pnfsdarg.op == PNFSDOP_FORCEDELDS)) {
        cp = malloc(PATH_MAX + 1, M_TEMP, M_WAITOK);
        error = copyinstr(pnfsdarg.dspath, cp, PATH_MAX + 1,
            NULL);
        if (error == 0)
                error = nfsrv_deldsserver(pnfsdarg.op, cp, td);
        free(cp, M_TEMP);

this can't be correct in face of different jails running it?

This version has two changes, suggested by jamie@
and mjg@.
1 - Moved the vfs_exjail_delete() call into prison_cleanup().
2 - Added a counter for how many mounts have been
exported by a prison and uses that count in vfs_exjail_delete().

vfs_exjail_delete() gets rid of the exports done within
the jail, just like shutting down the system gets rid of
them because the system goes away. (All that is left
are some malloc'd structures that will go away if/when
the file system is re-exported. Code that is not in this
commit checks that an export is for the correct jail
using mnt_exjail.)

I don't see any problem with this.

W.r.t. the pNFS case. Yes, a pNFS server is not supported
in a jail. It is not allowed, as coded in D37519.

This version has vfs_exjail_delete() modified so
that it does the song and dance to free the mnt_export
stuff. Personally, I think it just adds unnecessary
complexity, but if mjg@ feels it must be done, I think
this code does it. (When vfs_busy() fails it falls back to
just getting rid of the credential reference and setting
mnt_exjail NULL, so I am not worried about it.

There is now three variants:

  • what is currently in main
  • the previous version
  • this one with the extra complexity

Others can choose which they think should be in main.

Oops, I forgot to recheck mnt_exjail after vfs_busy()
and reacquiring the locks in vfs_exjail_delete().
I think vfs_exjail_delete() is correct now, although it
is pretty complicated.

sys/sys/jail.h
178

Could you use one of the pr_spare[] for this?

Use one pf the pr_spare's for pr_exportcnt as
requested by jamie@.

sys/kern/vfs_export.c
302

The patch is missing the corresponding tweak to nfsrv_v4rootexport().

471

I think it'd be useful to have a comment explaining why we want to release credential references even when we can't busy the mountpoint.

sys/kern/vfs_mount.c
765

Or use atomic_sub_int.

Made the changes suggested by markj@.

Since the patch now includes the call to vfs_export()
in nfs_nfsdport.c, this version also includes the checks
of mnt_exjail when the exports are to be used, as well.

Spotted a bug in vfs_exjail_delete(), where it released
and reacquired the MNT_ILOCK() mutex after testing
mnt_exjail.

This version fixes that.

I've also added "*credp = NULL;" at the beginning of
nfsvno_checkexp() to ensure it is returned NULL when
a mnt_exjail error is detected.

markj added inline comments.
sys/kern/vfs_export.c
516

Can we assert that i == 0 after the loop terminates?

This revision is now accepted and ready to land.Feb 20 2023, 3:43 PM
sys/kern/vfs_export.c
516

Good question. I think that jamie@ might be able
to answer this. Since this is called from prison_cleanup(),
I believe that the value of pr_exportcnt cannot increase.
As such, "i" is set to the upper bound of how many entries
there are in the mountlist.

I believe the algorithm is safe in that, if the value
of pr_exportcnt is decreased by something else during the
execution of this function, all that will happen is an extra
pass through the mountlist, looking for another entry that
no longer exists.

Can pr_exportcnt be decreased by something else during
the execution of this function?
I can think of two possible cases:
1 - If the file system is dismounted from outside of the
prison while the prison is being removed.
I don't know when the file systems used by a prison
are "unbusied" so that they can be dismounted?
If it when all the processes in the prison are terminated,
then I believe a dismount could happen before this
function completes, since prison_cleanup() is called
after all processes within the prison are gone.

2 - If there are multiple concurrent executions of this
function for the same prison. (I see several calls to
prison cleanup(), but I have no idea if that could
result in concurrent calls?)

In summary, I do not think it can be KASSERT()'d that
"i == 0", but jamie@ might know for sure?

sys/kern/vfs_export.c
516

I do think a KASSERT() for pr_exportcnt == 0 would
be ok here. I'll add that to the patch.

sys/kern/vfs_export.c
516

1 - If the file system is dismounted from outside of the
prison while the prison is being removed.

I have no idea about this one.

2 - If there are multiple concurrent executions of this
function for the same prison. (I see several calls to
prison cleanup(), but I have no idea if that could
result in concurrent calls?)

This won't happen, because allprison_lock is held exclusively.

516

Is there a reason pr_exportcnt is copied to a local variable first? It seems we could atomic_subtract_int(pr_exporcnt) directly - or better yet, use refcount(9) which I discovered when making pr_ref and pr_uref atomic.

All pr_exportcnt manipulation is done under the auspices of MNT_ILOCK (I think), so there shouldn't be a worry of a double-decrement as. And as mentioned, the fact that this is after all user processes are gone should protect against an unexpected increase.

sys/kern/vfs_export.c
516

I just thought copying it was easier than
using a bunch of atomic_load_int() calls.

I used atomics instead of just depending on
the MNT_ILOCK() mutex because I thought
it look "weird" that a field in "struct prison"
was protected by a mutex in "struct mount".

refcount(9) could be used as well. I didn't
because you don't break out of the loops
where you decrement the refcount atomic,
if you choose to do so.

I don't care which way it is coded, since they
are all safe and the worst case failure is one
additional pass through the mountlist.
pr_exportcnt is just an optimization to avoid
the extra pass through the mountlist. Since this
happens infrequently, I am also happy to just
get rid pr_exportcnt,

Let me know what you prefer.

sys/kern/vfs_export.c
516

I used atomics instead of just depending on
the MNT_ILOCK() mutex because I thought
it look "weird" that a field in "struct prison"
was protected by a mutex in "struct mount".

Now that I think a little more, MNT_ILOCK isn't enough to protect it, because pr_exportcnt isn't limited to a single mount point. So it still needs to be atomic.

I don't care which way it is coded, since they
are all safe and the worst case failure is one
additional pass through the mountlist.
pr_exportcnt is just an optimization to avoid
the extra pass through the mountlist. Since this
happens infrequently, I am also happy to just
get rid pr_exportcnt,

Let me know what you prefer.

I like pr_exportcnt existing because it bypasses the usual case of no exports at all. And since we know it won't be incrementing, I guess it's OK in a local variable here (perhaps with a comment). But I don't think we can assert that i == 0.