Page MenuHomeFreeBSD

Fix ZFS so that it sets mnt_exjail for snapshot automounts
ClosedPublic

Authored by rmacklem on Nov 19 2023, 3:14 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jan 17, 10:15 PM
Unknown Object (File)
Fri, Jan 17, 10:11 PM
Unknown Object (File)
Fri, Jan 17, 12:38 PM
Unknown Object (File)
Sun, Jan 12, 7:22 PM
Unknown Object (File)
Sat, Dec 28, 7:13 AM
Unknown Object (File)
Wed, Dec 25, 9:13 PM
Unknown Object (File)
Tue, Dec 24, 10:14 PM
Unknown Object (File)
Tue, Dec 24, 10:04 PM
Subscribers

Details

Summary

When a process attempts to access a snapshot under
/<dataset>/.zfs/snapshot, the snapshot is automounted.
However, without this patch, the automount does not
set mnt_exjail, which results in the snapshot not being
accessible over NFS.

This patch defines a new function called vfs_exjail_clone()
which sets mnt_exjail from another mount point and
then uses that function to set mnt_exjail in the snapshot
automount.

Test Plan

Created a snapshot in /<dataset>/.zfs/snapshot on an
NFS exported /dataset and then accessed it via NFS from
an NFS client.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

This problem is now in FreeBSD's bugzilla as
PR#275200 and OpenZFS's github as #15546.

Once reviewed, I can commit the addition of
vfs_exjail_clone(). Others will need to do the
ZFS commit stuff, since I have no idea what
that requires.

Ok, so am I supposed to get the ZFS part
reviewed over on OpenZFS's git via a pull
request?

Should I just have the non-ZFS part here?

@rmacklem Yes, you should open PR in OpenZFS git. We are trying to keep FreeBSD ZFS code in sync with upstream, allowing out of the base builds. But considering you also extend FreeBSD KPI, I guess the process should include first extending the KPI and bumping FreeBSD version, and then opening upstream PR ifdef'ing that version, since the same upstream ZFS sources are expected to work on any FreeBSD version starting from 12.2 and up.

And just to confirm (as Mike Karels suggested in an email),
by PR you mean a git pull request and not a Problem Report?

This patch is the vfs_exjail_clone() subset of the
previous one, since this is all I can commit to FreeBSD.

When I commit this, I will bump __FreeBSD_version for
main plus stable/14 and stable/13 when MFC'd.

After that, I can try and do a pull request of the ZFS
part on github. (I need all the __FreeBSD_version values
bumped as above, before I can create the ZFS patch.

This looks fine to me, but I'm not a vfs expert.

Do we need to interlock somehow with vfs_exjail_delete()? In other words, what happens if ZFS tries to clone the reference to a mountpoint as vfs_exjail_delete() is purging references to the prison, and the target mountpoint has already been visited?

Patch has been updated to check for "same prison
as the thread is in before doing the cloning".
I think this is sufficient to deal with interaction
with vfs_exjail_delete().

Since vfs_exjail_delete() is called when there are
no processes running in the jail, a caller of
vfs_exjail_clone() cannot be running in that jail.
(Probably running in prison0, since the jails should
be restricting other processes.)

For an NFS mount accessed by the nfsd, I beleive
this should never happen, since
the exports force the nfsd theads to be running in
the jail that did the exports.

However, a process on the server running outside
of the jails (or in prison0 if you prefer) could trigger
the "automount", even if the exports were done within
a jail.
--> For this case, the mnt_exjail should not be cloned.

Do we need to interlock somehow with vfs_exjail_delete()? In other words, what happens if ZFS tries to clone the reference to a mountpoint as vfs_exjail_delete() is purging references to the prison, and the target mountpoint has already been visited?

Good Catch!!

As noted in the comment for the updated patch, a process
that is not in the jail that exported the file system could
trigger the "automount". For this case, the cloning should
not be done.

I think this is sufficient to deal with interaction with
vfs_exjail_delete(), since it is called when there are
no processes running in the jail.

Since vfs_exjail_delete() is called when there are no processes running in the jail, a caller of vfs_exjail_clone() cannot be running in that jail.

It's not clear to me that the first part is true. In particular, prison_deref() first calls prison_deref_kill(), which calls prison_cleanup() and thus vfs_exjail_delete(); later it kills processes within the jail.

Hmm, unless I'm missing something, this is at odds with the comment in vfs_exjail_delete(). Perhaps we need to check the prison state when setting mnt_exjail? i.e., refuse to set it if the jail is dying.

rmacklem added a reviewer: jamie.

I realized that the previous version on the patch
would result in the weird case where:
A process running locally in the NFS server that
is not in the jail that has exported the fs could
automount the snapshot, but it would not be
exported.

Then the NFS client would not be able to access
the snapshot.

This version of the patch does what I think it a
correct interaction with the jail, to ensure it is
alive when cloned.

I have added jamie to the reviewers list, since
I think he understands the jail stuff.

Since vfs_exjail_delete() is called when there are no processes running in the jail, a caller of vfs_exjail_clone() cannot be running in that jail.

It's not clear to me that the first part is true. In particular, prison_deref() first calls prison_deref_kill(), which calls prison_cleanup() and thus vfs_exjail_delete(); later it kills processes within the jail.

Hmm, unless I'm missing something, this is at odds with the comment in vfs_exjail_delete(). Perhaps we need to check the prison state when setting mnt_exjail? i.e., refuse to set it if the jail is dying.

Well, the comment on vfs_exjail_delete() states that no processes are in the
prison, but maybe the comment is bogus?
Anyhow, I think this version might be ok, but hopefully you or jamie@ can
confirm this?

My understanding (which could be wrong) is that a prison cannot go from
active to dying when there is a p_uref held on it. If that is the case, then
prison_proc_hold()/prison_proc_free() should ensure that it remains alive
until after the cloning is done, I hope?

Yuck. It looks like the only way to ensure that a jail
does not transition from alive to dying is to hold
the allprison_lock or pr_mtx.

I'll look to see if I can hold the pr_mtx while
acquiring MNT_ILOCK(). If I cannot safely do that,
I will acquire the allprison_lock as well.
And I'll use prison_isalive() instead of checking the
pr_state directly.

Don't bother looking at this patch version.
I put up a new version later.

Since vfs_exjail_delete() is called when there are no processes running in the jail, a caller of vfs_exjail_clone() cannot be running in that jail.

It's not clear to me that the first part is true. In particular, prison_deref() first calls prison_deref_kill(), which calls prison_cleanup() and thus vfs_exjail_delete(); later it kills processes within the jail.

Hmm, unless I'm missing something, this is at odds with the comment in vfs_exjail_delete(). Perhaps we need to check the prison state when setting mnt_exjail? i.e., refuse to set it if the jail is dying.

Well, the comment on vfs_exjail_delete() states that no processes are in the
prison, but maybe the comment is bogus?
Anyhow, I think this version might be ok, but hopefully you or jamie@ can
confirm this?

My understanding (which could be wrong) is that a prison cannot go from
active to dying when there is a p_uref held on it. If that is the case, then
prison_proc_hold()/prison_proc_free() should ensure that it remains alive
until after the cloning is done, I hope?

That's no longer the case. prison_remove(2) will now mark a prison as dying, call prison_cleanup(), and then kill its processes. So as it is, holding pr_uref is enough to keep a prison from passively dying, but it can still be forced into dying state regardless. This is likely not the only case where I might have violated some assumptions with that change.

This version of the patch acquires a shared lock on
allprison_lock (which looks sufficient to ensure the
jail does not go from alive to dying.
It also acquires pr_mtx to check for prison_isalive(),
although I am not 100% sure it is needed?

This version of the patch acquires a shared lock on
allprison_lock (which looks sufficient to ensure the
jail does not go from alive to dying.
It also acquires pr_mtx to check for prison_isalive(),
although I am not 100% sure it is needed?

Because you need to lock both allprison_lock (exclusive) and pr_mtx to change the status, you only need one of those (like the shared allprison_lock) to check it.

This version of the patch only slocks allprison_lock,
as suggested by jamie@.

This revision is now accepted and ready to land.Nov 23 2023, 2:50 PM