Details

Reviewers

markj
jamie
kib
mjg

Commits

rG88175af8b75e: vfs_export: Add mnt_exjail to control exports done in prisons

Summary

Here is an alternative to D38144 using a reference to the
credentials that did the exports. Although reviewers
suggested this, I resisted because I was thinking I had
to get rid of exports when a jail is removed and that
was too difficult.

I realized that all I needed to do was release the credentials
and set mnt_exjail = NULL. Then, deletion/replacement of
the exports can be done "lazily" when vfs_export() is called.

I defined a function called vfs_exjail_delete(), which I currently
call from the nfsd'd OSD PR_MOETHOD_REMOVE which
releases credentials and set mnt_exjail NULL for all cases
matching the prison argument.
Maybe this function should be called from within kern_jail.c
via prison_cleanup()?

I think this is a better alternative to D38144 and if this
looks good, I will revert the commit of D38144 and commit
this instead. Thanks to all that hinted this was the way to go.

As they say "if at first you don't succeed, try try again...".

Test Plan

Testing is in progress using a system patched so
that mountd/nfsd can be run in vnet prisons.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

rmacklem created this revision.Feb 3 2023, 3:03 PM

Herald added a subscriber: imp. · View Herald TranscriptFeb 3 2023, 3:03 PM

rmacklem requested review of this revision.Feb 3 2023, 3:03 PM

I would expect killing the jail to *fail* if there any exports active. You can keep the counter of them in struct prison to avoid the need to scan anything.

With the assumption that nfsd in jail is *disabled* by default, this would not violate POLA.

In D38371#872242, @mjg wrote:

I would expect killing the jail to *fail* if there any exports active. You can keep the counter of them in struct prison to avoid the need to scan anything.

With the assumption that nfsd in jail is *disabled* by default, this would not violate POLA.

I don't want killing a jail to fail. At present, the only failures for jail_remove(2) are permission and nonexistent jail. Considering the most common situation for jail removal is system shutdown, its failure could be a problem best avoided.

I defined a function called vfs_exjail_delete(), which I currently
call from the nfsd'd OSD PR_MOETHOD_REMOVE which
releases credentials and set mnt_exjail NULL for all cases
matching the prison argument.
Maybe this function should be called from within kern_jail.c
via prison_cleanup()?

Yes, I think so. The OSD avenue worked, but since we can't have VNET nfsd in module form anyway (which is what the OSD is best suited for), you might as well avoid its extra locking layer and go with prison_cleanup.

jamie mentioned this in D38144: Add MNT_EXJAIL to differentiate between exports in prisons vs prison0.Feb 3 2023, 6:16 PM

In D38371#872249, @jamie wrote:

In D38371#872242, @mjg wrote:

I would expect killing the jail to *fail* if there any exports active. You can keep the counter of them in struct prison to avoid the need to scan anything.

With the assumption that nfsd in jail is *disabled* by default, this would not violate POLA.

I don't want killing a jail to fail. At present, the only failures for jail_remove(2) are permission and nonexistent jail. Considering the most common situation for jail removal is system shutdown, its failure could be a problem best avoided.

I think that's questionable. To illustrate consider 'rmdir' which fails if the directory is not empty -- I think it is a feature. Similarly, if appropriate (tm) clean up was not performed, killing the jail could fail too.

But if you insist on it, shutting the jail down should probably unexport everything and it does not with the proposed patch. It only cleans some creds.

Ok, I will try and address the various comments.

mountd(8) does not delete exports when it is terminated.
That was by design, so that, if mountd(8) died accidentally
(I did that with a typing error back when I was a sysadmin),
nfsd would continue to run and all a sysadmin had to do is
restart mountd. (It's even less of an issue for NFSv4, since
mountd(8) is not needed for new mounts to be done.)

In general, nothing gets rid of exports unless the file system
is dismounted (it actually leaks the structures, but that's a
bug for another day). mountd(8) will delete them when it
first starts up and will delete ones taken out of the exports(5)
file when it is relaoded (via a SIGHUP).

So, having exports at jail shutdown is normal and expected,
if mountd(8) has been running in the prison.

To get rid of the actual exports during shutdown is very messy.
You have to acquire a lockmgr() lock and you can't do that while
looping holding the moutnlist lock.
So, the patch clears mnt_exjail, which indicates that the exports
are now stale, then they go away when mountd(8) updates them
or the file system gets dismounted. This makes them non-usable,
only the structures have not been free'd yet, so it does "get rid of them".
If you don't get rid of the cred references, then the jail will get
stuck in dying state, as I understand it.
Note that the root fs for a jail will have been mounted outside of
the jail, so it doesn't get dismounted during "jail -r".

nfsd.ko does now load dynamically for the VNET_NFSD case.
It was fixed once I malloc'd the arrays and structures instead of
putting them in the vnet.
(kern_jail.c does need to be updated for this, but I was waiting
until everything else goes in main and then I was going to take
the defined(VNET_NFSD) and defined(NFSD) out of kern_jail.c
to enable it all.)

The only advantage of moving the vfs_exjail_delete() call to
prison_cleanup() is that the nfs cleanup function could them
be called from VNET_SYSUNIINIT() instead of OSD once it
is fixed so that it works. (Recall that "jail -r" on vnet prisons
does not currently work and leaves them in dying state.)

And, yes, adding a counter of exports to struct prison would
allow us to avoid (or break out of the scan) early, so I agree
that is a optimization worth coding for the next version of the patch.

So, having exports at jail shutdown is normal and expected,
if mountd(8) has been running in the prison.

This is a non-starter. Shutting down a jail has to be conceptually seen as shutting down the system, except with effects only limited to the jail at hand.

To get rid of the actual exports during shutdown is very messy.
You have to acquire a lockmgr() lock and you can't do that while
looping holding the moutnlist lock.

This bit is not a problem, see kern_getfsstat for an example.

Preferably there would be no linear scans to begin with, let alone over the global mount list.

Unfortunately there are several spots which open-code the traversal which makes it harder to introduce basic sanity in the area, but I'll look into it.

That said, I think the pragmatic course of action right now is to get another loan from the Technical Bank and copy the aforementioned traversal from kern_getfsstat, while damage-controlling it by counting per-jail exports (the counter should probably be modified in spots which ref/unref the mnt_exjail cred).

btw while grepping for mountlist I found this bit:

int
nfsrv_deldsserver(int op, char *dspathp, NFSPROC_T *p)
[..]
        mtx_lock(&mountlist_mtx);
        TAILQ_FOREACH(mp, &mountlist, mnt_list) {
                if (strcmp(mp->mnt_stat.f_mntonname, dspathp) == 0 &&
                    strcmp(mp->mnt_stat.f_fstypename, "nfs") == 0 &&
                    mp->mnt_data != NULL) {
                        nmp = VFSTONFS(mp);
                        NFSLOCKMNT(nmp);
                        if ((nmp->nm_privflag & (NFSMNTP_FORCEDISM |
                             NFSMNTP_CANCELRPCS)) == 0) {
                                nmp->nm_privflag |= NFSMNTP_CANCELRPCS;
                                NFSUNLOCKMNT(nmp);
                        } else {
                                NFSUNLOCKMNT(nmp);
                                nmp = NULL;
                        }
                        break;
                }
        }
        mtx_unlock(&mountlist_mtx);

The caller, nfssvc_nfsd, grabs the path from userspace and passes it through:

error = copyin(uap->argp, &pnfsdarg, sizeof(pnfsdarg));
if (error == 0 && (pnfsdarg.op == PNFSDOP_DELDSSERVER ||
    pnfsdarg.op == PNFSDOP_FORCEDELDS)) {
        cp = malloc(PATH_MAX + 1, M_TEMP, M_WAITOK);
        error = copyinstr(pnfsdarg.dspath, cp, PATH_MAX + 1,
            NULL);
        if (error == 0)
                error = nfsrv_deldsserver(pnfsdarg.op, cp, td);
        free(cp, M_TEMP);

this can't be correct in face of different jails running it?

This version has two changes, suggested by jamie@
and mjg@.
1 - Moved the vfs_exjail_delete() call into prison_cleanup().
2 - Added a counter for how many mounts have been
exported by a prison and uses that count in vfs_exjail_delete().

vfs_exjail_delete() gets rid of the exports done within
the jail, just like shutting down the system gets rid of
them because the system goes away. (All that is left
are some malloc'd structures that will go away if/when
the file system is re-exported. Code that is not in this
commit checks that an export is for the correct jail
using mnt_exjail.)

I don't see any problem with this.

W.r.t. the pNFS case. Yes, a pNFS server is not supported
in a jail. It is not allowed, as coded in D37519.

This version has vfs_exjail_delete() modified so
that it does the song and dance to free the mnt_export
stuff. Personally, I think it just adds unnecessary
complexity, but if mjg@ feels it must be done, I think
this code does it. (When vfs_busy() fails it falls back to
just getting rid of the credential reference and setting
mnt_exjail NULL, so I am not worried about it.

There is now three variants:

what is currently in main
the previous version
this one with the extra complexity

Others can choose which they think should be in main.

Oops, I forgot to recheck mnt_exjail after vfs_busy()
and reacquiring the locks in vfs_exjail_delete().
I think vfs_exjail_delete() is correct now, although it
is pretty complicated.

jamie added inline comments.Feb 4 2023, 4:43 PM

sys/sys/jail.h
178	Could you use one of the pr_spare[] for this?

rmacklem mentioned this in rGd94e0bdc1498: Revert "vfs_export: Add checks for correct prison when updating exports".Feb 4 2023, 10:41 PM

rmacklem marked an inline comment as done.Feb 4 2023, 11:17 PM

Use one pf the pr_spare's for pr_exportcnt as
requested by jamie@.

markj added inline comments.Feb 14 2023, 3:44 PM

sys/kern/vfs_export.c
302	The patch is missing the corresponding tweak to nfsrv_v4rootexport().
471	I think it'd be useful to have a comment explaining why we want to release credential references even when we can't busy the mountpoint.
sys/kern/vfs_mount.c
765	Or use `atomic_sub_int`.

rmacklem marked 3 inline comments as done.Feb 14 2023, 10:24 PM

Made the changes suggested by markj@.

Since the patch now includes the call to vfs_export()
in nfs_nfsdport.c, this version also includes the checks
of mnt_exjail when the exports are to be used, as well.

Spotted a bug in vfs_exjail_delete(), where it released
and reacquired the MNT_ILOCK() mutex after testing
mnt_exjail.

This version fixes that.

I've also added "*credp = NULL;" at the beginning of
nfsvno_checkexp() to ensure it is returned NULL when
a mnt_exjail error is detected.

rmacklem mentioned this in rGed03776ca7f4: nfsd: Enable the NFSD_VNET vnet front end macros.Feb 18 2023, 11:01 PM

markj accepted this revision.Feb 20 2023, 3:43 PM

markj added inline comments.

sys/kern/vfs_export.c
516	Can we assert that `i == 0` after the loop terminates?

This revision is now accepted and ready to land.Feb 20 2023, 3:43 PM

rmacklem added inline comments.Feb 20 2023, 9:45 PM

sys/kern/vfs_export.c
516	Good question. I think that jamie@ might be able to answer this. Since this is called from prison_cleanup(), I believe that the value of pr_exportcnt cannot increase. As such, "i" is set to the upper bound of how many entries there are in the mountlist. I believe the algorithm is safe in that, if the value of pr_exportcnt is decreased by something else during the execution of this function, all that will happen is an extra pass through the mountlist, looking for another entry that no longer exists. Can pr_exportcnt be decreased by something else during the execution of this function? I can think of two possible cases: 1 - If the file system is dismounted from outside of the prison while the prison is being removed. I don't know when the file systems used by a prison are "unbusied" so that they can be dismounted? If it when all the processes in the prison are terminated, then I believe a dismount could happen before this function completes, since prison_cleanup() is called after all processes within the prison are gone. 2 - If there are multiple concurrent executions of this function for the same prison. (I see several calls to prison cleanup(), but I have no idea if that could result in concurrent calls?) In summary, I do not think it can be KASSERT()'d that "i == 0", but jamie@ might know for sure?

rmacklem added inline comments.Feb 20 2023, 10:01 PM

sys/kern/vfs_export.c
516	I do think a KASSERT() for pr_exportcnt == 0 would be ok here. I'll add that to the patch.

jamie added inline comments.Feb 21 2023, 12:20 AM

sys/kern/vfs_export.c
516	1 - If the file system is dismounted from outside of the prison while the prison is being removed. I have no idea about this one. 2 - If there are multiple concurrent executions of this function for the same prison. (I see several calls to prison cleanup(), but I have no idea if that could result in concurrent calls?) This won't happen, because allprison_lock is held exclusively.
516	Is there a reason pr_exportcnt is copied to a local variable first? It seems we could atomic_subtract_int(pr_exporcnt) directly - or better yet, use refcount(9) which I discovered when making pr_ref and pr_uref atomic. All pr_exportcnt manipulation is done under the auspices of MNT_ILOCK (I think), so there shouldn't be a worry of a double-decrement as. And as mentioned, the fact that this is after all user processes are gone should protect against an unexpected increase.

rmacklem added inline comments.Feb 21 2023, 1:06 AM

sys/kern/vfs_export.c
516	I just thought copying it was easier than using a bunch of atomic_load_int() calls. I used atomics instead of just depending on the MNT_ILOCK() mutex because I thought it look "weird" that a field in "struct prison" was protected by a mutex in "struct mount". refcount(9) could be used as well. I didn't because you don't break out of the loops where you decrement the refcount atomic, if you choose to do so. I don't care which way it is coded, since they are all safe and the worst case failure is one additional pass through the mountlist. pr_exportcnt is just an optimization to avoid the extra pass through the mountlist. Since this happens infrequently, I am also happy to just get rid pr_exportcnt, Let me know what you prefer.

jamie added inline comments.Feb 21 2023, 3:34 AM

sys/kern/vfs_export.c
516	I used atomics instead of just depending on the MNT_ILOCK() mutex because I thought it look "weird" that a field in "struct prison" was protected by a mutex in "struct mount". Now that I think a little more, MNT_ILOCK isn't enough to protect it, because pr_exportcnt isn't limited to a single mount point. So it still needs to be atomic. I don't care which way it is coded, since they are all safe and the worst case failure is one additional pass through the mountlist. pr_exportcnt is just an optimization to avoid the extra pass through the mountlist. Since this happens infrequently, I am also happy to just get rid pr_exportcnt, Let me know what you prefer. I like pr_exportcnt existing because it bypasses the usual case of no exports at all. And since we know it won't be incrementing, I guess it's OK in a local variable here (perhaps with a comment). But I don't think we can assert that i == 0.

Closed by commit rG88175af8b75e: vfs_export: Add mnt_exjail to control exports done in prisons (authored by rmacklem). · Explain WhyFeb 21 2023, 9:03 PM

This revision was automatically updated to reflect the committed changes.

rmacklem added a commit: rG88175af8b75e: vfs_export: Add mnt_exjail to control exports done in prisons.

rmacklem mentioned this in rG56b45bf5f6e1: nfsd: Enable the NFSD_VNET vnet front end macros.May 18 2023, 11:03 PM

mark which jail did exports via a reference to the credentials
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 117707

sys/fs/nfsserver/nfs_nfsdport.c

sys/kern/kern_jail.c

sys/kern/vfs_export.c

sys/kern/vfs_mount.c

sys/sys/jail.h

sys/sys/mount.h

mark which jail did exports via a reference to the credentialsClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 117707

sys/fs/nfsserver/nfs_nfsdport.c

sys/kern/kern_jail.c

sys/kern/vfs_export.c

sys/kern/vfs_mount.c

sys/sys/jail.h

sys/sys/mount.h

mark which jail did exports via a reference to the credentials
ClosedPublic
Actions

Revision Contents
Changeset List