Page MenuHomeFreeBSD

Add support for mounting single files in nullfs
ClosedPublic

Authored by dfr on Nov 23 2022, 4:35 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Jan 5, 11:27 AM
Unknown Object (File)
Tue, Dec 31, 6:24 AM
Unknown Object (File)
Mon, Dec 9, 8:16 PM
Unknown Object (File)
Dec 4 2024, 12:08 AM
Unknown Object (File)
Dec 3 2024, 11:07 PM
Unknown Object (File)
Dec 3 2024, 2:33 AM
Unknown Object (File)
Dec 2 2024, 9:05 AM
Unknown Object (File)
Dec 1 2024, 2:45 PM

Details

Summary

My main use-case for this is to support mounting config files and secrets
into OCI containers. My current workaround copies the files into the
container which is messy and risks secrets leaking into container images
if the cleanup fails.

Allow mounting files as well as directories

This adds a VFCF flag to indicate whether the filesystem supports file
mounts and allows fspath to be either a directory or a file if the flag
is set.

Test Plan

$ sudo mkdir -p /mnt
$ sudo touch /mnt/foo
$ sudo mount -t nullfs /COPYRIGHT /mnt/foo

  1. head -1 /mnt/foo
  2. @(#)COPYRIGHT 8.2 (Berkeley) 3/21/94

The compilation of software known as FreeBSD is distributed under the
following terms:

Copyright (c) 1992-2021 The FreeBSD Project.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions
are met:

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Updated to include requested changes. I'm still working on a regular-file equivalent to vn_path_to_global_path.

In D37478#852943, @mjg wrote:

The invocation suggested is definitely a problem: sudo mount -t nullfs /COPYRIGHT /mnt/foo

should someone try to mount a regular file by accident on stock system their attempt is going to fail, patched one will succeed when it should not have.

I think there should be an explicit option (say -o regfile or similar) which would *require* the target to be VREG when used, VDIR otherwise.

I'm not sure what the problem is here. Assuming that /mnt/foo is a regular file, why should the example mount fail? Surely the checks to enforce that the two objects are either both file or both directory is sufficient.

sbin/mount/getmntopts.c
155

I am mostly attempting to follow existing here but there is clearly some redundancy. This kind of checking is happening in mount, mount_nullfs and kernel. It would be nice to have only kernel checking but it seems like that is best left until we can get better error messages from kernel.

sys/kern/vfs_mount.c
1564 ↗(On Diff #113551)

I will work on adding a vn_path_to_global_path_hardlink with similar semantics, ideally with a simpler interface to vn_fullpath_hardlink.

Add vn_path_to_global_path_hardlink. This works for trivial tests and makes
the code in vfs_domount more readable. I'll update my podman test VM and
stress test it a little.

dfr marked an inline comment as done.Nov 28 2022, 4:07 PM
In D37478#852960, @dfr wrote:
In D37478#852943, @mjg wrote:

The invocation suggested is definitely a problem: sudo mount -t nullfs /COPYRIGHT /mnt/foo

should someone try to mount a regular file by accident on stock system their attempt is going to fail, patched one will succeed when it should not have.

I think there should be an explicit option (say -o regfile or similar) which would *require* the target to be VREG when used, VDIR otherwise.

I'm not sure what the problem is here. Assuming that /mnt/foo is a regular file, why should the example mount fail? Surely the checks to enforce that the two objects are either both file or both directory is sufficient.

I mean they made a mistake and with your patch it will happen to "work" even though it is not what was meant.

sys/kern/vfs_mount.c
1558 ↗(On Diff #113555)

this needs to test on VREG? a switch statement is probably best here

In D37478#853071, @mjg wrote:
In D37478#852960, @dfr wrote:
In D37478#852943, @mjg wrote:

The invocation suggested is definitely a problem: sudo mount -t nullfs /COPYRIGHT /mnt/foo

should someone try to mount a regular file by accident on stock system their attempt is going to fail, patched one will succeed when it should not have.

I think there should be an explicit option (say -o regfile or similar) which would *require* the target to be VREG when used, VDIR otherwise.

I'm not sure what the problem is here. Assuming that /mnt/foo is a regular file, why should the example mount fail? Surely the checks to enforce that the two objects are either both file or both directory is sufficient.

I mean they made a mistake and with your patch it will happen to "work" even though it is not what was meant.

That kind of makes sense - the closest equivalent elsewhere is on Linux where bind mounts ignore the mount type but require the MS_BIND flag. I'm not sure gating this with an option is the best interface but don't feel that strongly. A possible alternative would be a sysctl to globally enable/disable this new feature?

sys/kern/vfs_mount.c
1558 ↗(On Diff #113555)

I don't see the need for a switch - the only reason I needed something other than vn_path_to_global_path here was that non-directories need a little help to resolve, e.g. parent directory plus leaf component. This would be the same for types other than VREG. Perhaps this can be reduced to just vn_path_to_global_path_hardlink since that should worok just as well for directories.

I can't think of a scenario where something other than VDIR or VREG is useful here but I would prefer to leave that kind of policy enforcement to vfs_domount_first which is where all the other sanity checks are applied.

Added a couple of comments above

you should ask pho@ to test this, I would not be surprised if an assert tripped over

addition of vn_path_to_global_path_hardlink has to be a standalone commit

sys/kern/vfs_mount.c
1558 ↗(On Diff #113555)

my point is that that you may find yourself with neither VDIR nor VREG here and pass it down to vfs_domount_first. bare minimum this requires a comment that the type is checked in that routine.

Added a comment to note that vnode type validation happens in vfs_domount_first. Also fixed a couple of line-too-long formatting issues.

I will also reach out to pho@ for advice on stress testing.

I'm going to have to take a closer look. v_mountedhere is in a union which can be used for other purposes for non-VDIR vnodes, chances are decent this trips over.

Split out vn_path_to_global_path_hardlink into its own commit.

Added an ugly workaround for a panic caused by running realpath on a file mount.

The proposed hack in kern___realpathat is definitely not the way to do it -- by the time the second lookup is performed the name may be different.

I figured this corner case should have an obvious solution: the target vnode has v_mountedhere, from there you should be able to reliably find the vnode from the mounted on filesystem, and that should be unique and consequently easily sorted out with reverse path resolution.

Except I realized there is another issue: hardlinks to the mount point losing the uniqueness which sounds like a huge can of worms. You can definitely check there are no hardlinks at mount time and deny creating them when mounted.

Another potential can of worms is stacked mounts, one after another. Should probably also get disallowed.

That said, I'm going to have to chew on this, there are more corner cases to consider that I initially thought.

In D37478#854930, @mjg wrote:

The proposed hack in kern___realpathat is definitely not the way to do it -- by the time the second lookup is performed the name may be different.

I figured this corner case should have an obvious solution: the target vnode has v_mountedhere, from there you should be able to reliably find the vnode from the mounted on filesystem, and that should be unique and consequently easily sorted out with reverse path resolution.

Except I realized there is another issue: hardlinks to the mount point losing the uniqueness which sounds like a huge can of worms. You can definitely check there are no hardlinks at mount time and deny creating them when mounted.

Another potential can of worms is stacked mounts, one after another. Should probably also get disallowed.

That said, I'm going to have to churn on this, there are more corner cases to consider that I initially thought.

I realised my change to kern___realpathat was broken shortly after I updated the diff. For one thing, it returns the wrong path if the input path is relative. At the moment, I'm conditionally disabling the vp_crossmp replacement for lookups where the mount point is the last path element but that probably opens up the same LOR from commit 7f92c4e which crossmp was introduced to fix.

Thanks for the hint about v_mountedhere - I will try to make that work. Also, restricting these mounts to files without hardlinks etc. makes sense to me so I will add a check for that.

I'm not confident the LOR is really there and even if it is, crossmp is a hack solution. unmount can instead be patched around to not suffer the problem, but I don't have the time right now to look into it.

In D37478#854931, @dfr wrote:

Thanks for the hint about v_mountedhere - I will try to make that work. Also, restricting these mounts to files without hardlinks etc. makes sense to me so I will add a check for that.

Erm, I meant v_mount. The idea being you can test this is a nullfs file mount, and if so, v_mount can be used to walk up as it contains a pointer to the covered vnode.

Second attempt at fixing realpath for file mounts - use the covered vnode
with vn_fullpath. This assumes that the covered vnode is in the name cache
which I'm not convinced is a good assumption.

This also prevents mounting over files with more than one hardlink.
Attempting to create a hardlink to a file mount after its mounted is already
forbidden since it would be a cross-device mount. We should also disable
stacking file mounts but I'm not sure the right way to detect this.

Slightly better logic when testing for file mounts in realpath

In D37478#855162, @pho wrote:

This one looks like an attempt to mount a file onto an existing file mount. I'm going to disable that.

Don't allow stacking of file mounts

Second attempt - we need to check for file mount stacking before the
call to vn_path_to_global_path_hardlink to avoid a panic

Third attempt at stopping stacks of file mounts. With this one, pho@'s
n30.sh test runs without panicing the first time it attempts to stack
files.

I have not observed any problems while testing with D37478.113884.patch

dfr marked 2 inline comments as done.Dec 8 2022, 12:10 PM

Thanks for all your work Peter!

Assuming that we complete the review process and get this change into -current, does anyone have concerns merging this into 13-stable? It would be really nice to have this feature available in 13.2-RELEASE.

kib added inline comments.
sbin/mount_nullfs/mount_nullfs.8
73
sys/kern/vfs_cache.c
3162 ↗(On Diff #113884)

Should be a blank line above this one.

That said, what does prevent reclaiming of nd.ni_vp? It is not locked, and this allows both unmount and reclaim, which causes v_mount to become NULL.

3165 ↗(On Diff #113884)
3856 ↗(On Diff #113884)

Since you unlock the vnode right after the assert, should you also assert that the vnode has use-ref on it?

BTW, I remember Solaris had the ability to mount a regular file over a regular file, without stacking helpers.

I mean, why requiring nullfs there? Just keep a reference to the lower vnode as v_mountedthere and make lookup() handle it. This is probably as close to Linux' bind mount as it is possible.

In D37478#856005, @kib wrote:

BTW, I remember Solaris had the ability to mount a regular file over a regular file, without stacking helpers.

I mean, why requiring nullfs there? Just keep a reference to the lower vnode as v_mountedthere and make lookup() handle it. This is probably as close to Linux' bind mount as it is possible.

I think using nullfs here simplifies the change quite a bit - doing things the bind way would involve inventing a way of expressing a non-filesystem mount to nmount, plumbing that through in /sbin/mount etc.

One thing we do get from nullfs is that the mounted file gets a new st_dev which prevents hardlinks if thats useful.

sys/kern/vfs_cache.c
3162 ↗(On Diff #113884)

I can lock nd.ni_vp before dereferencing v_mount. Would that be enough and do I still need to call vfs_busy? I'm not sure about the ordering of these locks.

sys/kern/vfs_cache.c
3162 ↗(On Diff #113884)

vfs_busy() is before any vnode lock, see a long explanation right above it in vfs_subr.c.

I think that locking ni_vp and checking for doomed state should be enough to guarantee the path to covered vnode reference.

Whitespace fix, add locking to kern___realpathat

dfr marked 2 inline comments as done.

Style fix 'if (error != 0)'

dfr marked an inline comment as done.Dec 10 2022, 2:10 PM
dfr added inline comments.
sys/kern/vfs_cache.c
3856 ↗(On Diff #113884)

I don't quite understand what a 'use-ref' is in this context. The assert here is mostly copying the logic in vn_path_to_global_path.

sys/kern/vfs_cache.c
3166 ↗(On Diff #114034)

LK_RETRY implies the check for VN_IS_DOOMED(), this block is not needed

3856 ↗(On Diff #113884)

vref() increments v_usecount, which I sort of mentioned by using the 'use-ref' abbreviation.

sys/kern/vfs_cache.c
3166 ↗(On Diff #114034)

Sorry, I mean "remove LK_RETRY", then vn_lock() does the check for VN_IS_DOOMED()

dfr marked 2 inline comments as done.

Simplify locking in kern___realpathat

sys/kern/vfs_cache.c
3856 ↗(On Diff #113884)

I could add KASSERT(vrefcnt(vp) > 0) but it doesn't match the logic in vn_path_to_global_path and both functions document their pre-conditions fairly clearly.

This revision is now accepted and ready to land.Dec 12 2022, 1:31 PM
dfr marked an inline comment as done.Dec 12 2022, 2:22 PM

Thanks Konstantin, Mateusz for all your help in the review process. Are there any other areas which need work here? I remember that Mateusz was interested in ensuring that existing erroneous attempts to mount files would succeed with this change. How serious is this potential problem? Perhaps we can mitigate by gating this with a sysctl? I plan to merge this to stable/13 so perhaps have this feature opt in for 13 and opt out for 14 and later?

In D37478#856652, @dfr wrote:

I remember that Mateusz was interested in ensuring that existing erroneous attempts to mount files would succeed with this change. How serious is this potential problem? Perhaps we can mitigate by gating this with a sysctl? I plan to merge this to stable/13 so perhaps have this feature opt in for 13 and opt out for 14 and later?

I personally do not think that this is a serious issue. But if you want to provide a gatekeeping, I would suggest adding a mount option (like mount -o allowfile -t nullfs /from /to). This option is not even needs to be passed to kernel, check for it in mount_null(8). IMO it is more explicit than a hidden sysctl.

I do not see why this patch, in any form, cannot be merged to 13, as is.

Personally, I don't think its necessary to be bug-for-bug compatible in most cases. If Mateusz feels strongly about it I will add a check in mount_nullfs as suggested but if not I'm happy with to move forwards as-is.

What "bug to bug compatiblity"?

I noted mount -t nullfs /some-file /some-other-file would fail on stock kernel and succeed on the patched one. If someone messes up and specifies regular files by accident, this is going to work out when it should not have from their perspective and this is why I suggested a switch to denote what they want, but I'm not going to insist on it.

code-wise, I'm deeply unsatisfied with this crossmp vnode business. I'll look around, perhaps it is fixable, in which case the above patch will be able to drop the realpath kludge.

Past that I don't see problems with the code and I'll ack in a day or two if the crossmp endeavor wont show anything promising.

In D37478#856863, @mjg wrote:

What "bug to bug compatiblity"?

I noted mount -t nullfs /some-file /some-other-file would fail on stock kernel and succeed on the patched one. If someone messes up and specifies regular files by accident, this is going to work out when it should not have from their perspective and this is why I suggested a switch to denote what they want, but I'm not going to insist on it.

I used the wrong term. I was just trying to point out that if an application has a bug (mistakenly mounting a file) or is trying to verify that FreeBSD does not have the ability to mount files, it should not stop us from implementating that feature.

code-wise, I'm deeply unsatisfied with this crossmp vnode business. I'll look around, perhaps it is fixable, in which case the above patch will be able to drop the realpath kludge.

Past that I don't see problems with the code and I'll ack in a day or two if the crossmp endeavor wont show anything promising.

I am also unhappy with the crossmp parts but did not want to attempt to change that in this diff - working around it seemed like the lower-risk option. Looking forward to seeing if you can find a cleaner way.

I ran a full stress2 test with D37478.114064.patch. No problems seen.

@markj is there an easy way to convince syzkaller to play with file mounts?

In D37478#857428, @mjg wrote:

@markj is there an easy way to convince syzkaller to play with file mounts?

I'm not sure about "easy", but it should be pretty straightforward. You'd add a pseudo-syscall probably modeled after syz_mount_image() in executor/common_linux.h. That'd be a C function which wraps nmount(2), and from syzkaller's perspective it's just another system call. Then add some syzlang definitions which bind specific parameter values for syz_mount_image, e.g., by specifying "nullfs" as the filesystem type. These would go in sys/freebsd/mount.txt or so. Again, Linux has some examples here in sys/linux/filesystem.h.

If anyone would like to take this on, let me know and I can provide more details.

well I don't have time to do the syzkaller bit, but someone(tm) should. i'm still not convinced the crossmp business is necessary, but this should not be holding up the patch either.

Thanks everyone. I'm going to rebase and land this today.

This revision was automatically updated to reflect the committed changes.