Page MenuHomeFreeBSD

add a linux compatible copy_file_range(2) syscall
ClosedPublic

Authored by rmacklem on Jun 10 2019, 1:20 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sun, Mar 10, 12:31 PM
Unknown Object (File)
Wed, Mar 6, 9:03 AM
Unknown Object (File)
Wed, Mar 6, 5:25 AM
Unknown Object (File)
Wed, Feb 28, 6:46 PM
Unknown Object (File)
Feb 25 2024, 11:28 AM
Unknown Object (File)
Feb 22 2024, 8:05 AM
Unknown Object (File)
Feb 22 2024, 8:05 AM
Unknown Object (File)
Feb 22 2024, 8:05 AM
Subscribers

Details

Summary

NFSv4.2 supports server side copying of byte ranges within a file. To support this, a new system call and VOP_xxx()
is required. (For file systems that don't support VOP_COPY_FILE_RANGE(), it simply does the copy in a loop using
the input file's preferred I/O size. This saves system calls compared with a copy done in userland, but I have no
idea if there is a measurable performance difference.)
This patch implements the copy_file_range(2) syscall and VOP_COPY_FILE_RANGE() to do this.
It is intended to be compatible with the Linux syscall of the same name.

It was written from the Linux man page. There are a couple of oddities in the Linux syscall, such as no way to
specify "copy to EOF" and requires the syscall to fail with EINVAL if the file offset plus len exceeds the file size
of the file being copied from.

Since VOP_IOCTL() must be called with the vnode unlocked, the code simply looks for "holes" by scanning the
entire read block to see if it all 0s, instead of using VOP_IOCTL() to try and find the holes.
(I should probably note what the Linux man page mentions w.r.t. using lseek(SEEK_HOLE) and lseek(SEEK_DATA)
for copying sparse files.)

One thing I am not sure of is if it will work for stacked file systems.

I strongly suspect there will be several iterations of this patch, but this is a first draft.

Test Plan

I have been testing it for normal and sparse files on UFS.
Hopefully someone else will do testing for ZFS.

At some point I will add/test use of it for the NFSv4.2 client/server in projects/nfsv42.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 24949

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This variant of the patch has the rangelock code moved from vn_copy_file_range() to
kern_copy_file_range(). The reason I did this is that the NFSv4.2 server needs to lock
the file ranges slightly differently (for the "to EOF" case) and I wanted it to be able
to call vn_copy_file_range() to do most of the work.

The semantics for the syscall should not have changed from the previous version of the patch.

share/man/man9/VOP_COPY_FILE_RANGE.9
38 ↗(On Diff #59232)

Use .Fo to avoid such long line.

sys/kern/vfs_subr.c
5675 ↗(On Diff #59232)

The cast is excessive.

More fixes. When testing using a file with a large (> 4Gbyte) hole in it, a couple
of bugs needed to be fixed.

  • If the "len" argument is > SSIZE_MAX, it can't be returned --> clip "len" at SSIZE_MAX
  • If the hole size is greater than what can be stored in size_tâ€Ĥ --> change xfer, xfer2 to off_t
  • If the hole extended past the end of the range being copied, xfer was set too large and would cause the len variable to wrap around. --> clip xfer at len after being calculated as the hole size.

I finally think the patch is ready for another review.

This version of the patch incorporates the changes recommended by kib@.
I have also added a couple of sentences related to copying sparse files to copy_file_range.2.

rmacklem marked 3 inline comments as done.

Made a minor change to VOP_COPY_FILE_RANGE.9 as recommended by asomers@.

Marked inline comments as done.

One more bug found while proofreading the code.

  • It erroneously used inoffp instead of outoffp when calculating the size of hole to write. My main test program copies byte ranges of a file and then compares the file, so inoff and outoff are the same. I did run a small test program copying from/to different offsets, but the file had data so I could easily check if it worked.

This version of the patch has this fixed. It also has a minor optimization, where it uses vn_truncate_locked() to grow the
output file when a hole is at the end of the range and the hole is entirely beyond the original EOF for the output file
instead of writing 0 bytes to grow the file.

I think this optimization actually makes the code less confusing.

jilles added inline comments.
lib/libc/sys/copy_file_range.2
100

The Linux man page (from http://man7.org/linux/man-pages/man2/copy_file_range.2.html ) says that a non-zero flags argument will cause the call to return an [EINVAL] error. I think that is better than ignoring the argument completely since it allows adding flags more safely (since there will not be existing applications that pass in, for example, uninitialized data as flags).

Taken comment w.r.t. handling of flags argument over to freebsd-current@, since more people will see it there.

lib/libc/sys/copy_file_range.2
100

I'll take this over to FreeBSD-current@, since it is a "big picture" discussion.
(If I knew how to put FreeBSD-current@ or FreeBSD-fs@ in the list(s) that
get these comments, life would be simpler;-)

This version of the patch checks for flags == 0 and returns EINVAL when not 0.
It also documents "must be 0" in the man page. This change seemed to be the
consensus from a discussion on FreeBSD-current@.

The discussion w.r.t. "if/when should signals interrupt the syscall" is ongoing.
However, it does seem that when the syscall returns EINTR/ERESTART, the file
offsets should be reset to the original values for the call, so that the syscall can
be redone.
This patch saves/restores the offsets for this case.
The only time this version of the patch will return EINTR/ERESTART is when a VOP_xxx()
call returns one of these. (Probably an NFS mount with "intr" option.)

Marked "check for flags == 0" as done on the inline comment.

One minor additional fix. For the case where the "len" argument is 0, the Linux
syscall returns 0. The previous patch did that, but it did 0 length (ie. start and end equal)
range locks, which the range lock code doesn't handle correctly.

The change is to just skip over range locking and the copy code when "len == 0".
This also makes it clear what the behaviour is for this case.

vn_copy_file_range() erroneously returned the error with *lenp set to the requested
length instead of 0 (the number of bytes copied).
This version of the patch has this fixed and a couple of extraneous brackets removed.

Found while testing the NFSv4.2 code that uses vn_copy_file_range() in the NFS server.

lib/libc/sys/copy_file_range.2
29

Don't forget to bump the man page date whenever you actually commit this.

107

SEEK_HOLE and SEEK_DATA should be marked up with .Dv.

share/man/man9/VOP_COPY_FILE_RANGE.9
28 ↗(On Diff #59881)

Also update this date when you commit.

74 ↗(On Diff #59881)

NULL should use .Dv.

75 ↗(On Diff #59881)

curthread should also use .Dv

94 ↗(On Diff #59881)

"is done" is ambiguous. Does that refer to before or after? Better to use the terminology used by other VOP man pages: "on entry" and "on return".

sys/kern/vfs_subr.c
5664 ↗(On Diff #59881)

s/addr/dat/

5691 ↗(On Diff #59881)

Should this be "a minimum of blksize in length"?

5719 ↗(On Diff #59881)

This would be more readable as xfer2 = MIN(xfer, blksize).

5765 ↗(On Diff #59881)

According to the man page, holein should be an int, not a long. But that looks like a bug in the man page.

5797 ↗(On Diff #59881)

I'm not following this logic. This says that if the current EOF lies in the range of bytes to be written, then first truncate the file down to the beginning of the range. Why? It seems that it would be useful, OTOH, to truncate the file up to the beginning of the range, if the current EOF lies below the start of the range.

5802 ↗(On Diff #59881)

The #endif can be moved below the error == 0 check.

5823 ↗(On Diff #59881)

This would be more readable as

blksize = MAX(holein, holeout);
if (blksize == 0)
        blksize = MAX(invp->v_mount->mnt_stat.f_iosize, outvp->v_mount->mnt_stat.f_iosize)
5835 ↗(On Diff #59881)

What's the significance of 1048576?

5875 ↗(On Diff #59881)

Again, this could be xfer = MIN(startoff - *inoffp, len). Or is there a reason you don't use the MIN and MAX macros?

5880 ↗(On Diff #59881)

Another good use for MIN

5902 ↗(On Diff #59881)

Use MIN here.

sys/kern/vfs_vnops.c
2532

The invp == outvp error case should return EBADF, not EINVAL.

sys/kern/vfs_syscalls.c
4846

You need a bwillwrite() call somewhere.
Due to the nature of the operation, I suspect that you need to insert the calls to bwillwrite() somewhere inside the copy loop, at the moment where no resources like vnode locks or started writes are held.

The previous version of the patch had a single bwillwrite() call before the copy
done by the VOP call or vn_generic_copy_file_range().
kib@ pointed out that bwillwrite() calls are needed in the copy loop, which makes
sense, since the copy loop could be writing a large amount of data and constipate
the buffer cache for file systems using it.

This version of the patch replaces the single bwillwrite() call with one done in
vn_write_outvp() for each blksize block written. The loop on blksize writes in
vn_write_outvp() was reorganized so that locking/unlocking was done inside the
loop so that bwillwrite() could be done in the loop. I did this because vn_write_outvp()
could be writing a very large "xfer" size if a large hole is being punched in the output file.

This version incorporates changes suggested by asomers@.

share/man/man9/VOP_COPY_FILE_RANGE.9
82 ↗(On Diff #60007)

This line should probably read "on return", too.

Replied and checked done to inline comments.

sys/kern/vfs_subr.c
5691 ↗(On Diff #59881)

Nope. The call is often done with xfer < blksize. blksize is simply the size of
the "dat" argument.

5765 ↗(On Diff #59881)

Never trust a man page. Always look at the code for the real documentation.;-)

5797 ↗(On Diff #59881)

For a simple example, lets say the output file is 15bytes in size and the input
file has a hole from bytes 10-20.
The code must "punch" a hole for bytes 10->14, since they might be non-zero.
This can be done via a write (allocating data to "fake" the hole) or
truncate the output file to 10bytes and then the copy can skip bytes 10-20 and
write the data after 20, creating a hole on the output file.
(Yes, unless the sizes are block sizes, there won't really be a hole, but I just picked
simple numbers for the example.) It is done here, so that copying past the output
file's new EOF can be done by skipping holes instead of writing 0 bytes to them.

The file is vn_truncate_locked()'d to grow it, if needed. (This only happens if there
is a hole at the end of the range being copied and the output file needs to be grown.)
It happens later in the copy loop when it gets to the end.

5835 ↗(On Diff #59881)

Not much. I don't know of a "this is the biggest file system block size for all file systems"
constant, so I chose 1Mbyte as a reasonable upper bound.
It's meant as a sanity check in case some file system sets hole size or f_iosize
to an absurdly large value.

blksize is how much is malloc()d. I didn't want it too large.

I would be comfortable with anything that is small enough to safely malloc() and
I suppose making it larger would handle the day when some file system uses > 1Mbyte
block sizes.

10Mbytes? 100Mbytes?

5875 ↗(On Diff #59881)

Well, there are a couple of reasons I didn't use them, but as you can see, I have changed
the code to use them. The reasons (not that significant) are:
1 - I fear that someday someone might replace MAX() with max(), because they don't like

macros (or type cast the arguments in MAX()). This might break this code, since some
variables are off_t (always int64_t I think) and some are size_t (uint32_t or uint64_t).

2 - When I write:

   x = y - z;
   if (x > X)
      x = X;
I read that as "I want x to be y - z", but then I want to clip it at X.
It's just the way I think when I code these things.
For x = MIN(y - z, X), I read this as "take the lesser of y - z and X and then need
to figure out why that is the case. But that's just me.

I agree that for the "blksize" case, the MAX() macros make the code more readable for me,
too.

One minor nit here...you never say "in my opinion" for any of these comments and,
obviously "more readable" is in the eye of the beholder.

sys/kern/vfs_syscalls.c
4846

This variant of the patch does a bwillwrite() before every "blksize" write, so hopefully
this will avoid a large copy from flooding the buffer cache.

sys/kern/vfs_vnops.c
2532

The syscall actually got this right and I don't even recall what the correct NFSv4.2
error is for the NFS server to return.
But it should be and now is fixed.

asomers added inline comments.
sys/kern/vfs_subr.c
5797 ↗(On Diff #59881)

Ok, I get it. Simply writing zeros would create a dense output file. To create a sparse output file, it's necessary to truncate down first.

This revision is now accepted and ready to land.Jul 22 2019, 4:05 AM

I think that compat32 wrappers are required. I promise to implement that myself after you commit the implementation.

sys/kern/vfs_subr.c
5748 ↗(On Diff #60007)

Proper place for vn_ functions is vfs_vnops.c.

5835 ↗(On Diff #60007)

I would write this as 1024 * 1024.

5940 ↗(On Diff #60007)

Excess () around cantseek.

sys/kern/vnode_if.src
732

Why do you need separate credentials for in/out ?

rmacklem marked 3 inline comments as done.

This updates incorporates changes suggested by kib@.
Replace 1048576 with 1024 * 1024 for readability and move
vn_generic_copy_file_range() and helper functions from
vfs_subr.c to vfs_vnops.c.

This revision now requires review to proceed.Jul 22 2019, 3:19 PM

This version incorporates a change to VOP_COPY_FILE_RANGE.9 suggested by asomers@.

Marked done and replied to inline comments.

sys/kern/vnode_if.src
732

Well, I think a process could...

infd = open();
setuid(1010);   /* or setgid() or... */
outfd = open();
ret = copy_file_range(infd, NULL, outfd, NULL, 1000, 0);

such that the in and out credentials are different.

Now, it happens that NFS can only use one credential, but that's its problem.
My current code for NFS tries one and then, if it fails with EACCES, it tries the other.
If both attempts fail, it drops back to vn_generic_copy_file_range(), which will use
the different credentials for the reads vs writes.

This revision is now accepted and ready to land.Jul 22 2019, 4:04 PM
This revision was automatically updated to reflect the committed changes.
rmacklem marked an inline comment as done.