Details

Reviewers

markj
mjg
jah
olce
rmacklem

Commits

rG326836a1ebc6: nullfs: do not allow bypass on copy_file_range()
rGc5405d1c8507: vn_copy_file_range(): provide ENOSYS fallback to vn_generic_copy_file_range()
rGa9bc8637690c: vn_copy_file_range(): find write vnodes on which to call the VOP
rG4cbe4c48a7e5: VFS: add VOP_GETLOWVNODE()

Summary

VFS: add VOP_GETWRITEVNODE()

It is similar to VOP_GETWRITEMOUNT(), and for given vnode vp should
return the lower vnode which would actually handle write to vp.

vn_copy_file_range(): find write vnodes on which to call the VOP

vn_copy_file_range(): provide ENOSYS fallback to vn_generic_copy_file_range()

nullfs: do not allow bypass on copy_file_range()

There must be no callers of VOP_COPY_FILE_RANGE() except
vn_copy_file_range(), which does enough to find the write-vnodes where
to call the VOP.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kib created this revision.Nov 14 2023, 8:11 PM

Herald added a subscriber: imp. · View Herald TranscriptNov 14 2023, 8:11 PM

kib requested review of this revision.Nov 14 2023, 8:11 PM

cy added a subscriber: cy.Nov 14 2023, 9:35 PM

chris_cretaforce.gr added a subscriber: chris_cretaforce.gr.Nov 14 2023, 9:48 PM

kib added reviewers: rmacklem, markj, mjg, jah.Nov 14 2023, 10:09 PM

kib edited the summary of this revision. (Show Details)

kib added a subscriber: pho.

vn_copy_file_range passes down vnodes from different mount points and all filesystems implementing the vop (apart from zfs) have an explicit check that mount points patch. iow the check that this is an instance of the same filesystem type is redundant for their case.

So imo the thing to do is to *remove* the mnt_vfc comparison from vn_copy_file_range and instead make it in zfs.

Then nullfs code wont have to deal with this bit either, it can just unwrap vnodes and pass them down to VOP_COPY_FILE_RANGE.

In D42603#971994, @mjg wrote:

vn_copy_file_range passes down vnodes from different mount points and all filesystems implementing the vop (apart from zfs) have an explicit check that mount points patch. iow the check that this is an instance of the same filesystem type is redundant for their case.

So imo the thing to do is to *remove* the mnt_vfc comparison from vn_copy_file_range and instead make it in zfs.

Then nullfs code wont have to deal with this bit either, it can just unwrap vnodes and pass them down to VOP_COPY_FILE_RANGE.

I think it should be reverse. VFS must ensure that either fs accepts non-equal mp's, or check for inmp == outmp. I will do the pass. Right now (or with your proposal) all filesystems have to implement fallback as well.

The VFS layer trying to babysit all filesystems is a long standing design flaw, which adds overhead to everyone and only makes optimisations clunky. For example for almost all filesystems VOP_CLOSE has next to nothing to do and most definitely does not need write suspension nor the vnode lock (and for zfs the routine is a nop) -- if there was no attempt to decide for the filesystem what it needs, there would be no problem.

In case of copy_file_range it was originally pre-decided that filesystems can't do cross mount copies, but then it turned out zfs very much can do them and the check was augmented for fs type. Perhaps further special optimizations will be possible for copies across different fs types.

Should one add the usual flags to special case everything in the layer, that would require even more flags to add later.

Meanwhile, in the current code everyone (modulo zfs) already has a mount point check, which already guarantees they only operate on their "own" vnodes without adding cost elsewhere.

I plan to do something different after this patch goes in.
Most complications for fs come from the need to call vn_generic_copy_file_range() as fallback. I intend to make a special return code (or just repurpose ENOSYS) to ask VFS to do that after VOP call. It seems that this even eliminates the need of the flag to allow cross-mount copies, not sure completely. That would fit the idea that VOPs should implement just fs-specific code.

Anyway, this patch is an immediate fix and should go in.

zlei added a subscriber: zlei.Nov 16 2023, 3:29 AM

markj added inline comments.Nov 16 2023, 3:02 PM

sys/fs/nullfs/null_vnops.c
1157	How can this situation happen? vn_lock() ensures that the vnodes are not doomed.

rmacklem added inline comments.Nov 16 2023, 4:28 PM

sys/fs/nullfs/null_vnops.c
1157	I think it's because the upper vnodes are now unlocked. Does the fact that the upper mounts are busied before the call guarantee that the lower mounts will remain valid here and not be free'd? (If that were to happen, inlmp or outlmp might point to other data.) It might be safer to capture inlvp->v_mount->mnt_vfc and outlvp->v_mount->mnt_vfc above when the vnodes are locked. i.e. invfc = inlvp->v_mount->mnt_vfc; and outvfc = outlvp->v_mount->mnt_vfc done above when the vnode is locked. You could then use those values for the check below. (As mjg@ noted, since *mnt_vfc is static, it is always safe to use and the strcmp() can be replaced with invfc == outvfc.) i.e. Replace line 1166,1167 with if (invfc == outvfc)

In D42603#972555, @kib wrote:

I plan to do something different after this patch goes in.
Most complications for fs come from the need to call vn_generic_copy_file_range() as fallback. I intend to make a special return code (or just repurpose ENOSYS) to ask VFS to do that after VOP call. It seems that this even eliminates the need of the flag to allow cross-mount copies, not sure completely. That would fit the idea that VOPs should implement just fs-specific code.

There are 0 complications which would be eliminated with flags or ENOSYS.

In particular introduction of ENOSYS will mean the same things have to be tested in vops, except now the vop wrapper has to check the error. If (possibly renamed) vn_generic_copy_file_range is a documented "if you don't know what to do, call this" routine things just work without the extra check.

I massaged what I mean into a patch, with your nullfs change as basis:

https://people.freebsd.org/~mjg/copy_file_range_nullfs.diff

edit: I think there is a more general problem here with nullfs vs unionfs and other stacks. I think what's needed is a vop to unwrap a vnode recursively, to always get to the bottom.

kib added inline comments.Nov 16 2023, 7:57 PM

sys/fs/nullfs/null_vnops.c
1157	Yes, upper vnodes are unlocked, and then lower vnodes as well, allowing them to be reclaimed.

In D42603#971994, @mjg wrote:

Then nullfs code wont have to deal with this bit either, it can just unwrap vnodes and pass them down to VOP_COPY_FILE_RANGE.

As you realized later, this is far less simple than that.

In D42603#971995, @kib wrote:

I think it should be reverse. VFS must ensure that either fs accepts non-equal mp's, or check for inmp == outmp. I will do the pass. Right now (or with your proposal) all filesystems have to implement fallback as well.

The generic VFS indeed most probably should be in charge here, see below.

In D42603#972173, @mjg wrote:

The VFS layer trying to babysit all filesystems is a long standing design flaw, which adds overhead to everyone and only makes optimisations clunky. For example for almost all filesystems VOP_CLOSE has next to nothing to do and most definitely does not need write suspension nor the vnode lock (and for zfs the routine is a nop) -- if there was no attempt to decide for the filesystem what it needs, there would be no problem.

Not sure what you are meaning by "babysit". Deciding for the filesystems what is good for them can be a problem, but not only this can be generally solved by having them declare what they can cope with and adapt the generic code, it can even be a better design when the different axes of expectations have a clear meaning for all, or almost all, filesystems (I'm saying axes, not directions on the axes).

As a simple example, VOP_READDIR() implementations should check (on INVARIANTS) that the passed vnode is a directory. Today, some of the filesystems check this unconditionally, unnecessary duplicating code, and others don't at all whereas they should. This check should be placed in the VOP_READDIR() pre function and we should remove all ad-hoc filesystem code. This wouldn't preclude adding filesystems that would allow VOP_READDIR() on non-directories (although the generic layer IIRC is not prepared to handle that), it is just a matter of adding a flag to use in struct vfs_conf and a trivial modification in the generic code to forego the automatic check.

In D42603#972173, @mjg wrote:

In case of copy_file_range it was originally pre-decided that filesystems can't do cross mount copies, but then it turned out zfs very much can do them and the check was augmented for fs type. Perhaps further special optimizations will be possible for copies across different fs types.

From a pure technical standpoint, it may seem that this initial decision was shortsighted given that ZFS has been around for a long time now and manages space by pools and not filesystems, so it should come as no surprise that cross-mount copies can (now) share blocks (I think btrfs has been doing this for a while). But I guess it was rather made simply to ease developing an initial version of the functionality in the first place, because there's no semantical problem in not supporting optimized copies even when they could and the most common case after all is optimizing them within a single filesystem. Additionally, evolving the initial situation is not harder than dealing with it from the start.

On the other hand, optimized support for copies across different FS types has no practical chance to ever become a reality. This would defeat the purpose of having filesystems formats/protocols in the first place and would be extremely fragile. The idea works, e.g., on ZFS precisely because it can operate at the very low level of blocks. So there's no need to worry about a subsequent evolution in this direction because most probably there won't be any. If you have concrete ideas that oppose this reasoning, I'm all ears.

In D42603#972173, @mjg wrote:

Should one add the usual flags to special case everything in the layer, that would require even more flags to add later.

Not in the case at hand, see below.

In D42603#972850, @mjg wrote:

In particular introduction of ENOSYS will mean the same things have to be tested in vops, except now the vop wrapper has to check the error. If (possibly renamed) vn_generic_copy_file_range is a documented "if you don't know what to do, call this" routine things just work without the extra check.

I guess @kib's idea here was that ENOSYS would just percolate up to the generic code. The wrappers never need to check for ENOSYS, they just pass it along.

In D42603#972851, @mjg wrote:

edit: I think there is a more general problem here with nullfs vs unionfs and other stacks. I think what's needed is a vop to unwrap a vnode recursively, to always get to the bottom.

Indeed, there can be arbitrary stacked layers. But the main difficulty here, the one that has the most influence on the architecture to use, is that, contrary to anything that has existed so far in the VFS, you have to "get to the bottom" of (IOW, unwrap recursively) *both* passed vnodes, which the VFS model of single dispatch can't support automatically. And this is the reason why it should most probably be managed by vn_copy_file_range() (or some other location in the generic VFS): If it is not, then each stacked filesystem will have to add the (same) code to try to unwrap the destination vnode.

For the unwrapping, two new VOPs are necessary (or a single with flags, it doesn't matter much), which are to unwrap a vnode with the intent to read for the first and to read/write for the second.

This is what I can foresee given the problem and the discussion so far. After having seen some code, we'll know whether there is some devil in the details.

Assuming that busying the upper mount points
ensures that the lower mount points do not
disappear, I am fine with this patch.
I am also fine with leaving it to the file
system's vop call to check for same fs or fs type.

When I did the code, I did not expect other file
system's VOP_COPY_FILE_RANGE() to need to decide
if/when it needed to be done via vn_generic_copy_file_range().
As such, I coded it with the NFS client calling vn_generic_copy_file_range().
I have no problem with this being changed to the VOP_COPY_FILE_RANGE()
returning an error to indicate "do it with vn_generic_copy_file_range()".

Complete rewrite:

provide ENOSYS fallback, removing calls to vn_generic_copy_file_range() from all VOP implementations
handle bypass for nullfs with VOP_GETWRITEVNODE()

This becomes as clear as it could be, IMO.

Herald added subscribers: asomers, delphij. · View Herald TranscriptNov 18 2023, 9:42 AM

The overall result looks great (and also confirms my previous comment)!

Given my inline comment suggesting to implement something like VOP_GETREADVNODE(), you may think of an alternative (which, as said above already, I don't advocate, but mention here for completeness) consisting of not having this new call but then having implementations of VOP_COPY_FILE_RANGE() call VOP_GETWRITEVNODE() themselves. But I'd prefer to keep it the way it's now plus a new VOP_GETREADVNODE().

Thanks.

sys/fs/nullfs/null_vnops.c
1142	Handling of several layers of stacked filesystems here requires calling `VOP_GETWRITEVNODE()` on the lower vnode.
sys/kern/vfs_vnops.c
3081 ↗	(On Diff #130296)	It's strange to see `VOP_GETWRITEVNODE()` used on a vnode that is not going to be written. Moreover, in the case of unionfs, retrieving a vnode for write may not give the same result as retrieving a vnode for read, because in the first case (write) the lower vnode may have to be copied up, which is unnecessary in the second (read). So I think this should instead be a call to a new `VOP_GETREADVNODE()` or something.

emaste added a subscriber: emaste.Nov 21 2023, 3:28 PM

I am not familiar with the stacking of vfs's, so I
can't really help with this.

markj added inline comments.Nov 23 2023, 3:01 PM

sys/fs/nullfs/null_vnops.c
1142	Indeed, I'd expect this to behave like VOP_GETWRITEMOUNT and recursively fetch the write vnode from the lower vnode.

Make nullfs bypass for VOP_GETWRITEVNODE() recursive.
Add flags to specify intent (for unionfs).

olce added inline comments.Nov 23 2023, 8:00 PM

sys/kern/vfs_vnops.c
3081 ↗	(On Diff #130296)	You've collapsed `VOP_GETWRITEVNODE()` and the proposed `VOP_GETREADVNODE()` in a single VOP call, that's great. Could you then consider renaming it, e.g., into something like `VOP_GETLOWERVNODE()` or `VOP_GETINNERVNODE()`, now that it is not only about writing?

kib added inline comments.Nov 24 2023, 5:12 AM

sys/kern/vfs_vnops.c
3081 ↗	(On Diff #130296)	I considered that. I like the symmetry between VOP_GETWRITEMOUNT() and VOP_GETWRITEVNODE(), which is why I selected this slightly inconsistent name.

olce added inline comments.Nov 24 2023, 9:22 AM

sys/kern/vfs_vnops.c
3081 ↗	(On Diff #130296)	I can understand that, especially when you created it for the initial version. It is certainly less cognitive burden short-term. But now that the operation takes a flag that can be `FREAD`, the symmetry is much less clear (although what's still common is the "unwrap" part) and the name misleading. IMHO, it makes it harder to "discover" that the operation to get a vnode to read from actually exists, so I don't find it very wise in a longer-term perspective. Could you please reconsider?

s/VOP_GETWRITEVNODE/VOP_GETLOWVNODE

olce accepted this revision.Nov 24 2023, 5:06 PM

olce added inline comments.

sys/kern/vfs_vnops.c
3081 ↗	(On Diff #130296)	Thank you!

This revision is now accepted and ready to land.Nov 24 2023, 5:06 PM

markj accepted this revision.Nov 27 2023, 5:28 PM

So again what's the benefit of bubbling up ENOSYS? I assumed it would at least get handled in a post-vop hook instead of going all the way up to the caller.

For comparison here is a patch which incorporates the unwrapping routine and still makes everyone all the fallback instead: https://people.freebsd.org/~mjg/copy_file_range_nullfs2.diff

In D42603#975872, @mjg wrote:

So again what's the benefit of bubbling up ENOSYS? I assumed it would at least get handled in a post-vop hook instead of going all the way up to the caller.

For comparison here is a patch which incorporates the unwrapping routine and still makes everyone all the fallback instead: https://people.freebsd.org/~mjg/copy_file_range_nullfs2.diff

I thought that the benefits are clear from the code:

the generic handler is called in only one place, instead of requiring all VOP implementations to care about it. This follows the principle that VOP should implement the method for its fs and not generic service.
pre/post hooks are too abusive in this case, where all VOP callers can (and shall) use vn_copy_file_range() and not VOP directly. Not using post hook makes the code more explicit.

In D42603#975872, @mjg wrote:

So again what's the benefit of bubbling up ENOSYS? I assumed it would at least get handled in a post-vop hook instead of going all the way up to the caller.

Now that the unwrapping happens at start, the call to VOP_COPY_FILE_RANGE() is no more recursive, so the "way up" is exactly one return.

Closed by commit rG4cbe4c48a7e5: VFS: add VOP_GETLOWVNODE() (authored by kib). · Explain WhyNov 28 2023, 5:34 PM

This revision was automatically updated to reflect the committed changes.

kib added a commit: rG4cbe4c48a7e5: VFS: add VOP_GETLOWVNODE().

kib added a commit: rGa9bc8637690c: vn_copy_file_range(): find write vnodes on which to call the VOP.

kib added a commit: rGc5405d1c8507: vn_copy_file_range(): provide ENOSYS fallback to vn_generic_copy_file_range().

kib added a commit: rG326836a1ebc6: nullfs: do not allow bypass on copy_file_range().