Paths

Table of Contentst

vfs: Add support for file cloning to VOP_COPY_FILE_RANGE
ClosedPublic
Actions

Authored by rmacklem on Aug 7 2025, 8:41 PM.

Details

Reviewers

kib
asomers
mav

Commits

rG37b2cb5ecb0f: vfs: Add support for file cloning to VOP_COPY_FILE_RANGE

Summary

NFSv4 has a separate CLONE operation from COPY with
a couple of semantics differences. Unlike COPY, CLONE
must complete the "copy on write" and cannot return
partially copied. It also is required to use offsets (and
the length if not to EOF) that are aligned to a buffer
boundary.

Since VOP_COPY_FILE_RANGE() can already do "copy on write"
for file systems that support it, such as ZFS with block
cloning enabled, all this patch does is add a flag called
COPY_FILE_RANGE_CLONE so that it will conform to the
rule that it must do a "copy on write" to completion.

The patch also adds a new pathconf(2) name _PC_CLONE_BLKSIZE,
which acquires the blocksize requirement for cloning and
returns 0 for file systems that do not support the
"copy on write" feature. (This is needed for the NFSv4.2
clone_blksize attribute.)

This patch will allow the implementation of CLONE
for NFSv4.2.

Man page patches will be done as separate reviews.

Test Plan

Ran a simple copy_file_range() on both a ZFS file
system with block cloning enabled and a UFS file
system that does not support cloning.
(The kernel had a ZFS patch that set _PC_CLONE_BLKSIZE
correctly.)

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

rmacklem created this revision.Aug 7 2025, 8:41 PM

Herald added subscribers: olce, imp. · View Herald TranscriptAug 7 2025, 8:41 PM

rmacklem requested review of this revision.Aug 7 2025, 8:41 PM

Shouldn't there also be some ZFS code that returns ENOSYS if the cloning requirement cannot be satisfied? And fuse_vnop_copy_file_range cannot ever satisfy that requirement, so it should return ENOSYS just like in vn_generic_copy_file_range

This revision now requires changes to proceed.Aug 7 2025, 9:16 PM

In D51808#1183453, @asomers wrote:

Shouldn't there also be some ZFS code that returns ENOSYS if the cloning requirement cannot be satisfied? And fuse_vnop_copy_file_range cannot ever satisfy that requirement, so it should return ENOSYS just like in vn_generic_copy_file_range

If fuse has its own VOP_COPY_FILE_RANGE() then,
yes, it should. I'll go look for it.
(zfs_copy_file_range() is ok, since it just falls back on
vn_generic_copy_file_range(), which returns the error.)

The NFS client needs to handle this as well, but that's
a separate commit.

Here's a stupid question. Is it possible for nfsd to reexport an NFS mountpoint? If so, then nfs_copy_file_range will need to check the flag, too.

Added check to fuse_copy_file_range() as pointed
out by asomers@.

In D51808#1183460, @asomers wrote:

Here's a stupid question. Is it possible for nfsd to reexport an NFS mountpoint? If so, then nfs_copy_file_range will need to check the flag, too.

Not a stupid question. Linux allows this and they have endless bugs to deal
with because of it. FreeBSD does not and I hope to keep it that way.

In D51808#1183471, @rmacklem wrote:

In D51808#1183460, @asomers wrote:

Here's a stupid question. Is it possible for nfsd to reexport an NFS mountpoint? If so, then nfs_copy_file_range will need to check the flag, too.

Not a stupid question. Linux allows this and they have endless bugs to deal
with because of it. FreeBSD does not and I hope to keep it that way.

Should we add the check anyway, just in case we ever add the FIONBCLONE (or whatever it's called) ioctl for userspace processes to use?

In D51808#1183472, @asomers wrote:

In D51808#1183471, @rmacklem wrote:

In D51808#1183460, @asomers wrote:

Here's a stupid question. Is it possible for nfsd to reexport an NFS mountpoint? If so, then nfs_copy_file_range will need to check the flag, too.

Not a stupid question. Linux allows this and they have endless bugs to deal
with because of it. FreeBSD does not and I hope to keep it that way.

Should we add the check anyway, just in case we ever add the FIONBCLONE (or whatever it's called) ioctl for userspace processes to use?

My intent for the NFS client is to support COPY_FILE_RANGE_CLONE
for NFSv4.2 servers that support the CLONE operation.
But give me a day or two to come up with the NFS client/server
patches. They are a little work.

Ok. I'm sure that you'll get the NFS client part done. The rest of it looks fine to me.

This revision is now accepted and ready to land.Aug 7 2025, 10:03 PM

Closed by commit rG37b2cb5ecb0f: vfs: Add support for file cloning to VOP_COPY_FILE_RANGE (authored by rmacklem). · Explain WhyAug 8 2025, 12:59 AM

This revision was automatically updated to reflect the committed changes.

rmacklem added a commit: rG37b2cb5ecb0f: vfs: Add support for file cloning to VOP_COPY_FILE_RANGE.

The kernel had a ZFS patch that set _PC_CLONE_BLKSIZE correctly.

I am not sure what you mean with "correctly", but for a note, in ZFS different files might have different block sizes if created (grown beyond one block) during recordsize property was set to different values, or if the file is smaller than recordsize. Also the last block is predictably more flexible, but also with some limitations.

In D51808#1183580, @mav wrote:

The kernel had a ZFS patch that set _PC_CLONE_BLKSIZE correctly.

I am not sure what you mean with "correctly", but for a note, in ZFS different files might have different block sizes if created (grown beyond one block) during recordsize property was set to different values, or if the file is smaller than recordsize. Also the last block is predictably more flexible, but also with some limitations.

I just used vp->v_mount->mnt_stat.f_iosize which is 128K for
my test system. To be honest, since block cloning still works
no matter what the alignment, I could probably use any non-zero
value. I do have to send a fixed non-zero value to the NFSv4.2
clients, so they know what to use. Actually, RFC7862 doesn't seem
to say if the attribute is per-file or per-file-system, but I'd guess
it is meant to be the latter.
The RFC says the offsets must be block aligned, but it will be the
NFSv4.2 server code that will check that.

So, unless you think there is a better answer than
vp->v_mount->mnt_stat.f_iosize, I think that will work.
(I will take a look at the block cloning code called from
zfs_copy_file_range() to see what it uses?)

In D51808#1183592, @rmacklem wrote:

So, unless you think there is a better answer than
vp->v_mount->mnt_stat.f_iosize,

If it is per-file-system, then this is probably the best you can get.

I think that will work.

I think it may matter only if client decide to clone file in several parts/requests. It may fail if the file appear having its block size bigger than f_iosize. It may be not very often, but possible.

(I will take a look at the block cloning code called from
zfs_copy_file_range() to see what it uses?)

Take a look at the beginning of zfs_clone_range(). It has a number of different constraints.

In D51808#1183889, @mav wrote:

In D51808#1183592, @rmacklem wrote:

So, unless you think there is a better answer than
vp->v_mount->mnt_stat.f_iosize,

If it is per-file-system, then this is probably the best you can get.

I think that will work.

I think it may matter only if client decide to clone file in several parts/requests. It may fail if the file appear having its block size bigger than f_iosize. It may be not very often, but possible.

(I will take a look at the block cloning code called from
zfs_copy_file_range() to see what it uses?)

Take a look at the beginning of zfs_clone_range(). It has a number of different constraints.

I did and from what I can see, they are exactly the constraints
that RFC7862 specifies for the NFSv4.2 CLONE operation.
(I suspect this all came from the way Linux does it?)

The one thing I have found is that zfs_clone_range() uses
z_blksz, which varies depending on the file's size.
(It is 512 for small files and 128K for bigger ones, for my
setup. mnt_stat.f_iosize is 128K.)

I am trying to find out if the NFSv4.2 clone_blksize
attribute is considered per-file or per-file-system.

Oh, and I might have found a bug in zfs_copy_file_range().
(It doesn't appear to advance the offsets when presented
as pointers and is not using f_offsets. I'll investigate further.)

This is getting interesting, rick

In D51808#1184315, @rmacklem wrote:

In D51808#1183889, @mav wrote:

In D51808#1183592, @rmacklem wrote:

So, unless you think there is a better answer than
vp->v_mount->mnt_stat.f_iosize,

If it is per-file-system, then this is probably the best you can get.

I think that will work.

I think it may matter only if client decide to clone file in several parts/requests. It may fail if the file appear having its block size bigger than f_iosize. It may be not very often, but possible.

(I will take a look at the block cloning code called from
zfs_copy_file_range() to see what it uses?)

Take a look at the beginning of zfs_clone_range(). It has a number of different constraints.

I did and from what I can see, they are exactly the constraints
that RFC7862 specifies for the NFSv4.2 CLONE operation.
(I suspect this all came from the way Linux does it?)

The one thing I have found is that zfs_clone_range() uses
z_blksz, which varies depending on the file's size.
(It is 512 for small files and 128K for bigger ones, for my
setup. mnt_stat.f_iosize is 128K.)

I am trying to find out if the NFSv4.2 clone_blksize
attribute is considered per-file or per-file-system.

Oh, and I might have found a bug in zfs_copy_file_range().
(It doesn't appear to advance the offsets when presented
as pointers and is not using f_offsets. I'll investigate further.)

The bug is in vn_copy_file_range() and not in the ZFS function.
For the case where the offset arguments are being passed in,
it doesn't update them properly.
I'll work on it, but it is not a bug in the ZFS code.

rick

This is getting interesting, rick

In D51808#1184330, @rmacklem wrote:

In D51808#1184315, @rmacklem wrote:

In D51808#1183889, @mav wrote:

In D51808#1183592, @rmacklem wrote:

So, unless you think there is a better answer than
vp->v_mount->mnt_stat.f_iosize,

If it is per-file-system, then this is probably the best you can get.

I think that will work.

I think it may matter only if client decide to clone file in several parts/requests. It may fail if the file appear having its block size bigger than f_iosize. It may be not very often, but possible.

(I will take a look at the block cloning code called from
zfs_copy_file_range() to see what it uses?)

Take a look at the beginning of zfs_clone_range(). It has a number of different constraints.

I did and from what I can see, they are exactly the constraints
that RFC7862 specifies for the NFSv4.2 CLONE operation.
(I suspect this all came from the way Linux does it?)

The one thing I have found is that zfs_clone_range() uses
z_blksz, which varies depending on the file's size.
(It is 512 for small files and 128K for bigger ones, for my
setup. mnt_stat.f_iosize is 128K.)

I am trying to find out if the NFSv4.2 clone_blksize
attribute is considered per-file or per-file-system.

Oh, and I might have found a bug in zfs_copy_file_range().
(It doesn't appear to advance the offsets when presented
as pointers and is not using f_offsets. I'll investigate further.)

The bug is in vn_copy_file_range() and not in the ZFS function.
For the case where the offset arguments are being passed in,
it doesn't update them properly.
I'll work on it, but it is not a bug in the ZFS code.

rick

This is getting interesting, rick

Ok, so I fixed the bug, which was not in the ZFS code.
(It was in kern_copy_file_range() starting in May.)

As for what ZFS should return for _PC_CLONE_BLKSIZE,
it looks like there are three choices:
1 - z_blksz, which applies to that file and can range from

512bytes->16Mbytes, if I understand things.
Note that any files < recordsize in size will be a
single block and only block clonable if the entire
file is being cloned.

2 - f_iosize, which is the value of "recordsize" when the

mount was done. It can be changed, although I'm not
sure if the change to "recordsize" takes effect before
a remount?

3 - zfs_max_recordsize, which is currently 16Mbytes.

I think #3 is the safe answer for the NFS server, although
I can also see an argument for #2.
(Although RFC7862 is not clear on it, the Linux NFSv4.2
client seems to assume that the value applies to all files
in a file system.)

rick