Details

Reviewers

kib
jhb
asomers
brooks

Group Reviewers

manpages

Commits

rS350319: Create a man page for VOP_COPY_FILE_RANGE(9).
rS350318: Add an entry for copy_file_range(2) to unistd.h.
rS350317: Add libc support for the copy_file_range(2) syscall added by r350315.
rS350315: Add kernel support for a Linux compatible copy_file_range(2) syscall.

Summary

NFSv4.2 supports server side copying of byte ranges within a file. To support this, a new system call and VOP_xxx()
is required. (For file systems that don't support VOP_COPY_FILE_RANGE(), it simply does the copy in a loop using
the input file's preferred I/O size. This saves system calls compared with a copy done in userland, but I have no
idea if there is a measurable performance difference.)
This patch implements the copy_file_range(2) syscall and VOP_COPY_FILE_RANGE() to do this.
It is intended to be compatible with the Linux syscall of the same name.

It was written from the Linux man page. There are a couple of oddities in the Linux syscall, such as no way to
specify "copy to EOF" and requires the syscall to fail with EINVAL if the file offset plus len exceeds the file size
of the file being copied from.

Since VOP_IOCTL() must be called with the vnode unlocked, the code simply looks for "holes" by scanning the
entire read block to see if it all 0s, instead of using VOP_IOCTL() to try and find the holes.
(I should probably note what the Linux man page mentions w.r.t. using lseek(SEEK_HOLE) and lseek(SEEK_DATA)
for copying sparse files.)

One thing I am not sure of is if it will work for stacked file systems.

I strongly suspect there will be several iterations of this patch, but this is a first draft.

Test Plan

I have been testing it for normal and sparse files on UFS.
Hopefully someone else will do testing for ZFS.

At some point I will add/test use of it for the NFSv4.2 client/server in projects/nfsv42.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 24949

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

This variant of the patch has the rangelock code moved from vn_copy_file_range() to
kern_copy_file_range(). The reason I did this is that the NFSv4.2 server needs to lock
the file ranges slightly differently (for the "to EOF" case) and I wanted it to be able
to call vn_copy_file_range() to do most of the work.

The semantics for the syscall should not have changed from the previous version of the patch.

Harbormaster completed remote builds in B25136: Diff 59232.Jul 1 2019, 2:00 AM

kib added inline comments.Jul 1 2019, 9:19 PM

share/man/man9/VOP_COPY_FILE_RANGE.9
38 ↗	(On Diff #59232)	Use .Fo to avoid such long line.
sys/kern/vfs_subr.c
5675 ↗	(On Diff #59232)	The cast is excessive.

More fixes. When testing using a file with a large (> 4Gbyte) hole in it, a couple
of bugs needed to be fixed.

If the "len" argument is > SSIZE_MAX, it can't be returned --> clip "len" at SSIZE_MAX
If the hole size is greater than what can be stored in size_t… --> change xfer, xfer2 to off_t
If the hole extended past the end of the range being copied, xfer was set too large and would cause the len variable to wrap around. --> clip xfer at len after being calculated as the hole size.

I finally think the patch is ready for another review.

Harbormaster completed remote builds in B25170: Diff 59297.Jul 2 2019, 1:44 AM

This version of the patch incorporates the changes recommended by kib@.
I have also added a couple of sentences related to copying sparse files to copy_file_range.2.

Harbormaster completed remote builds in B25172: Diff 59299.Jul 2 2019, 2:41 AM

Made a minor change to VOP_COPY_FILE_RANGE.9 as recommended by asomers@.

Harbormaster completed remote builds in B25173: Diff 59300.Jul 2 2019, 2:53 AM

Marked inline comments as done.

One more bug found while proofreading the code.

It erroneously used inoffp instead of outoffp when calculating the size of hole to write. My main test program copies byte ranges of a file and then compares the file, so inoff and outoff are the same. I did run a small test program copying from/to different offsets, but the file had data so I could easily check if it worked.

This version of the patch has this fixed. It also has a minor optimization, where it uses vn_truncate_locked() to grow the
output file when a hole is at the end of the range and the hole is entirely beyond the original EOF for the output file
instead of writing 0 bytes to grow the file.

I think this optimization actually makes the code less confusing.

Harbormaster completed remote builds in B25195: Diff 59371.Jul 3 2019, 9:14 PM

jilles added a subscriber: jilles.Jul 5 2019, 1:45 PM

jilles added inline comments.

lib/libc/sys/copy_file_range.2
100	The Linux man page (from http://man7.org/linux/man-pages/man2/copy_file_range.2.html ) says that a non-zero flags argument will cause the call to return an [EINVAL] error. I think that is better than ignoring the argument completely since it allows adding flags more safely (since there will not be existing applications that pass in, for example, uninitialized data as flags).

Taken comment w.r.t. handling of flags argument over to freebsd-current@, since more people will see it there.

lib/libc/sys/copy_file_range.2
100	I'll take this over to FreeBSD-current@, since it is a "big picture" discussion. (If I knew how to put FreeBSD-current@ or FreeBSD-fs@ in the list(s) that get these comments, life would be simpler;-)

This version of the patch checks for flags == 0 and returns EINVAL when not 0.
It also documents "must be 0" in the man page. This change seemed to be the
consensus from a discussion on FreeBSD-current@.

The discussion w.r.t. "if/when should signals interrupt the syscall" is ongoing.
However, it does seem that when the syscall returns EINTR/ERESTART, the file
offsets should be reset to the original values for the call, so that the syscall can
be redone.
This patch saves/restores the offsets for this case.
The only time this version of the patch will return EINTR/ERESTART is when a VOP_xxx()
call returns one of these. (Probably an NFS mount with "intr" option.)

Harbormaster completed remote builds in B25216: Diff 59456.Jul 6 2019, 12:26 AM

Marked "check for flags == 0" as done on the inline comment.

One minor additional fix. For the case where the "len" argument is 0, the Linux
syscall returns 0. The previous patch did that, but it did 0 length (ie. start and end equal)
range locks, which the range lock code doesn't handle correctly.

The change is to just skip over range locking and the copy code when "len == 0".
This also makes it clear what the behaviour is for this case.

Harbormaster completed remote builds in B25335: Diff 59733.Jul 13 2019, 11:40 PM

vn_copy_file_range() erroneously returned the error with *lenp set to the requested
length instead of 0 (the number of bytes copied).
This version of the patch has this fixed and a couple of extraneous brackets removed.

Found while testing the NFSv4.2 code that uses vn_copy_file_range() in the NFS server.

Harbormaster completed remote builds in B25395: Diff 59881.Jul 18 2019, 4:05 PM

asomers added inline comments.Jul 21 2019, 7:41 PM

lib/libc/sys/copy_file_range.2
29	Don't forget to bump the man page date whenever you actually commit this.
107	SEEK_HOLE and SEEK_DATA should be marked up with `.Dv`.
share/man/man9/VOP_COPY_FILE_RANGE.9
28 ↗	(On Diff #59881)	Also update this date when you commit.
74 ↗	(On Diff #59881)	NULL should use `.Dv`.
75 ↗	(On Diff #59881)	`curthread` should also use `.Dv`
94 ↗	(On Diff #59881)	"is done" is ambiguous. Does that refer to before or after? Better to use the terminology used by other VOP man pages: "on entry" and "on return".
sys/kern/vfs_subr.c
5664 ↗	(On Diff #59881)	s/addr/dat/
5691 ↗	(On Diff #59881)	Should this be "a minimum of blksize in length"?
5719 ↗	(On Diff #59881)	This would be more readable as `xfer2 = MIN(xfer, blksize)`.
5765 ↗	(On Diff #59881)	According to the man page, `holein` should be an `int`, not a `long`. But that looks like a bug in the man page.
5797 ↗	(On Diff #59881)	I'm not following this logic. This says that if the current EOF lies in the range of bytes to be written, then first truncate the file down to the beginning of the range. Why? It seems that it would be useful, OTOH, to truncate the file up to the beginning of the range, if the current EOF lies below the start of the range.
5802 ↗	(On Diff #59881)	The `#endif` can be moved below the `error == 0` check.
5823 ↗	(On Diff #59881)	This would be more readable as blksize = MAX(holein, holeout); if (blksize == 0) blksize = MAX(invp->v_mount->mnt_stat.f_iosize, outvp->v_mount->mnt_stat.f_iosize)
5835 ↗	(On Diff #59881)	What's the significance of 1048576?
5875 ↗	(On Diff #59881)	Again, this could be `xfer = MIN(startoff - *inoffp, len)`. Or is there a reason you don't use the `MIN` and `MAX` macros?
5880 ↗	(On Diff #59881)	Another good use for `MIN`
5902 ↗	(On Diff #59881)	Use `MIN` here.
sys/kern/vfs_vnops.c
2532	The `invp == outvp` error case should return `EBADF`, not `EINVAL`.

kib added inline comments.Jul 21 2019, 10:55 PM

sys/kern/vfs_syscalls.c
4846	You need a bwillwrite() call somewhere. Due to the nature of the operation, I suspect that you need to insert the calls to bwillwrite() somewhere inside the copy loop, at the moment where no resources like vnode locks or started writes are held.

The previous version of the patch had a single bwillwrite() call before the copy
done by the VOP call or vn_generic_copy_file_range().
kib@ pointed out that bwillwrite() calls are needed in the copy loop, which makes
sense, since the copy loop could be writing a large amount of data and constipate
the buffer cache for file systems using it.

This version of the patch replaces the single bwillwrite() call with one done in
vn_write_outvp() for each blksize block written. The loop on blksize writes in
vn_write_outvp() was reorganized so that locking/unlocking was done inside the
loop so that bwillwrite() could be done in the loop. I did this because vn_write_outvp()
could be writing a very large "xfer" size if a large hole is being punched in the output file.

Harbormaster completed remote builds in B25450: Diff 60004.Jul 22 2019, 1:58 AM

This version incorporates changes suggested by asomers@.

Harbormaster completed remote builds in B25451: Diff 60007.Jul 22 2019, 3:27 AM

asomers added inline comments.Jul 22 2019, 3:43 AM

share/man/man9/VOP_COPY_FILE_RANGE.9
82 ↗	(On Diff #60007)	This line should probably read "on return", too.

Replied and checked done to inline comments.

sys/kern/vfs_subr.c
5691 ↗	(On Diff #59881)	Nope. The call is often done with xfer < blksize. blksize is simply the size of the "dat" argument.
5765 ↗	(On Diff #59881)	Never trust a man page. Always look at the code for the real documentation.;-)
5797 ↗	(On Diff #59881)	For a simple example, lets say the output file is 15bytes in size and the input file has a hole from bytes 10-20. The code must "punch" a hole for bytes 10->14, since they might be non-zero. This can be done via a write (allocating data to "fake" the hole) or truncate the output file to 10bytes and then the copy can skip bytes 10-20 and write the data after 20, creating a hole on the output file. (Yes, unless the sizes are block sizes, there won't really be a hole, but I just picked simple numbers for the example.) It is done here, so that copying past the output file's new EOF can be done by skipping holes instead of writing 0 bytes to them. The file is vn_truncate_locked()'d to grow it, if needed. (This only happens if there is a hole at the end of the range being copied and the output file needs to be grown.) It happens later in the copy loop when it gets to the end.
5835 ↗	(On Diff #59881)	Not much. I don't know of a "this is the biggest file system block size for all file systems" constant, so I chose 1Mbyte as a reasonable upper bound. It's meant as a sanity check in case some file system sets hole size or f_iosize to an absurdly large value. blksize is how much is malloc()d. I didn't want it too large. I would be comfortable with anything that is small enough to safely malloc() and I suppose making it larger would handle the day when some file system uses > 1Mbyte block sizes. 10Mbytes? 100Mbytes?
5875 ↗	(On Diff #59881)	Well, there are a couple of reasons I didn't use them, but as you can see, I have changed the code to use them. The reasons (not that significant) are: 1 - I fear that someday someone might replace MAX() with max(), because they don't like macros (or type cast the arguments in MAX()). This might break this code, since some variables are off_t (always int64_t I think) and some are size_t (uint32_t or uint64_t). 2 - When I write: x = y - z; if (x > X) x = X; I read that as "I want x to be y - z", but then I want to clip it at X. It's just the way I think when I code these things. For x = MIN(y - z, X), I read this as "take the lesser of y - z and X and then need to figure out why that is the case. But that's just me. I agree that for the "blksize" case, the MAX() macros make the code more readable for me, too. One minor nit here...you never say "in my opinion" for any of these comments and, obviously "more readable" is in the eye of the beholder.
sys/kern/vfs_syscalls.c
4846	This variant of the patch does a bwillwrite() before every "blksize" write, so hopefully this will avoid a large copy from flooding the buffer cache.
sys/kern/vfs_vnops.c
2532	The syscall actually got this right and I don't even recall what the correct NFSv4.2 error is for the NFS server to return. But it should be and now is fixed.

asomers accepted this revision.Jul 22 2019, 4:05 AM

asomers added inline comments.

sys/kern/vfs_subr.c
5797 ↗	(On Diff #59881)	Ok, I get it. Simply writing zeros would create a dense output file. To create a sparse output file, it's necessary to truncate down first.

This revision is now accepted and ready to land.Jul 22 2019, 4:05 AM

I think that compat32 wrappers are required. I promise to implement that myself after you commit the implementation.

sys/kern/vfs_subr.c
5748 ↗	(On Diff #60007)	Proper place for vn_ functions is vfs_vnops.c.
5835 ↗	(On Diff #60007)	I would write this as 1024 * 1024.
5940 ↗	(On Diff #60007)	Excess () around cantseek.
sys/kern/vnode_if.src
732	Why do you need separate credentials for in/out ?

This updates incorporates changes suggested by kib@.
Replace 1048576 with 1024 * 1024 for readability and move
vn_generic_copy_file_range() and helper functions from
vfs_subr.c to vfs_vnops.c.

This revision now requires review to proceed.Jul 22 2019, 3:19 PM

Harbormaster completed remote builds in B25454: Diff 60021.Jul 22 2019, 3:19 PM

This version incorporates a change to VOP_COPY_FILE_RANGE.9 suggested by asomers@.

Harbormaster completed remote builds in B25455: Diff 60022.Jul 22 2019, 3:50 PM

Marked done and replied to inline comments.

sys/kern/vnode_if.src
732	Well, I think a process could... infd = open(); setuid(1010); /* or setgid() or... */ outfd = open(); ret = copy_file_range(infd, NULL, outfd, NULL, 1000, 0); such that the in and out credentials are different. Now, it happens that NFS can only use one credential, but that's its problem. My current code for NFS tries one and then, if it fails with EACCES, it tries the other. If both attempts fail, it drops back to vn_generic_copy_file_range(), which will use the different credentials for the reads vs writes.

kib accepted this revision.Jul 22 2019, 4:04 PM

This revision is now accepted and ready to land.Jul 22 2019, 4:04 PM

Closed by commit rS350315: Add kernel support for a Linux compatible copy_file_range(2) syscall. (authored by rmacklem). · Explain WhyJul 25 2019, 5:46 AM

This revision was automatically updated to reflect the committed changes.

rmacklem marked an inline comment as done.

rmacklem added a commit: rS350315: Add kernel support for a Linux compatible copy_file_range(2) syscall..

rmacklem added a commit: rS350317: Add libc support for the copy_file_range(2) syscall added by r350315..Jul 25 2019, 6:06 AM

rmacklem added a commit: rS350318: Add an entry for copy_file_range(2) to unistd.h..Jul 25 2019, 6:11 AM

rmacklem added a commit: rS350319: Create a man page for VOP_COPY_FILE_RANGE(9)..Jul 25 2019, 6:20 AM

add a linux compatible copy_file_range(2) syscall
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 58815

include/unistd.h

lib/libc/sys/Makefile.inc

lib/libc/sys/Symbol.map

lib/libc/sys/copy_file_range.2

sys/kern/syscalls.master

sys/kern/vfs_default.c

sys/kern/vfs_syscalls.c

sys/kern/vfs_vnops.c

sys/kern/vnode_if.src

sys/sys/syscallsubr.h

sys/sys/vnode.h

add a linux compatible copy_file_range(2) syscallClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 58815

include/unistd.h

lib/libc/sys/Makefile.inc

lib/libc/sys/Symbol.map

lib/libc/sys/copy_file_range.2

sys/kern/syscalls.master

sys/kern/vfs_default.c

sys/kern/vfs_syscalls.c

sys/kern/vfs_vnops.c

sys/kern/vnode_if.src

sys/sys/syscallsubr.h

sys/sys/vnode.h

add a linux compatible copy_file_range(2) syscall
ClosedPublic
Actions

Revision Contents
Changeset List