Page MenuHomeFreeBSD

fix copy_file_range(2) so that it doesn't allocate unneeded data blocks of all 0s in the output file
ClosedPublic

Authored by rmacklem on Aug 4 2019, 1:07 AM.

Details

Summary

When the byte range for copy_file_range(2) doesn't go to EOF on the
output file and there is a hole in the input file, a hole must be
"punched" in the output file. This is done by writing a block of bytes
all set to 0.
Without this patch, the write is done unconditionally which meant that,
if the output file already had a hole in that byte range, a unneeded data block
of all 0 bytes would be allocated.
This patch adds code to check for a hole in the output file, so it can
skip doing the write if there is already a hole in that byte range of
the output file. This avoids unnecessary allocation of blocks to the
output file.

Test Plan

Running the same tests I've already developed for copy_file_range(2) and
checking block allocations for the input and output files via "ls -ls".

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

rmacklem created this revision.Aug 4 2019, 1:07 AM
kib added inline comments.Aug 4 2019, 12:28 PM
kern/vfs_vnops.c
2799 ↗(On Diff #60435)

Am I right that the rangelock guarantees that the hole is not instantiated into real content by other thread ? What about calls from NFS server, where range lock is not taken ?

2809 ↗(On Diff #60435)

Should this be KASSERT instead ?

rmacklem marked 2 inline comments as done.Aug 4 2019, 3:16 PM

Replied to kib@'s comments inline.

kern/vfs_vnops.c
2799 ↗(On Diff #60435)

Yes. And since the NFSv4.2 server takes the rangelocks as well, the NFSv4.2
Copy will be ok as well.

Slightly off topic, but I have been thinking that the NFS server should take
rangelocks for read/write as well,, but haven't coded it yet.
Do you think it should?

2809 ↗(On Diff #60435)

Good point. I was thinking that it didn't need to panic(), since copy_file_range(2)
still works for this failure case.
However, since I put it in INVARIANTS, it won't be used for production kernels
and a panic() will get people's attention.

Do you think a KASSERT()/panic() is better?

rmacklem updated this revision to Diff 60448.Aug 4 2019, 4:32 PM
rmacklem marked 2 inline comments as done.

This version of the patch has the INVARIANTS printf() replaced by a KASSERT().
It also has KASSERT()s added for the results of FIOSEEKDATA/FIOSEEKHOLE.

The main change is that it now does a sanity check for dataoff == holeoff,
which I believe could happen if another thread were to do a truncate()
followed by a write() between the two VOP_IOCTL() calls.
(As noted in the comment for it.)

kib added inline comments.Aug 4 2019, 8:19 PM
kern/vfs_vnops.c
2799 ↗(On Diff #60435)

Rangelocks only make sense when the locked ranges correspond to the user-initiated reads and writes. I have no idea about NFSv4.x, but for plain old v3 server I am sure that NFS client' initiated RPCs for reads and writes typically do not match user operations. Instead, they are driven by the buffer caching and clustering. In other words, for NFS server serving the same file to several clients, rangelocking is probably useless.

Rangelocking on NFS server might be somewhat useful when the remotely accessed file is also accessed by the local userspace. In this case, I think, local operations would see more consistent results.

Replied to inline comment.

kern/vfs_vnops.c
2799 ↗(On Diff #60435)

Yes, agreed with all of that.

The reason I was thinking of adding it is that the NFSv4.2 Copy operation will loop on
read/write locking/unlocking the vnodes, so a rangelock ensures that the entire
range gets copied without corruption from another Copy or local write.
However, a concurrent NFS write would corrupt the result unless the NFS write
does a rangelock as well. (Doing them for read/write in the NFS server is a bit of a
bother, since it requires the vnode be unlocked/relocked, but I don't think that's a
big deal.)

I added the "dataoff == holeoff" check to try and at least make sure no crashes
can occur and the worst case is a file that is a mix of Copy data and Write data.
(I'll check the vn_copy_file_range() code more carefully, to see if I can find any other
cases where this might cause a crash/loop.)

rmacklem updated this revision to Diff 60455.Aug 4 2019, 10:10 PM

This version of the patch adds an additional sanity check for
the results from VOP_IOCTL(). This change can only have an effect
for a non-INVARIANTS kernel where one of the VOP_IOCTL()s
return a bogus offset.
It ensures a reasonable positive value for xfer2 is returned for this case.

rmacklem updated this revision to Diff 60603.Aug 8 2019, 10:56 PM

Only change is a modified comment explaining how FIOSEEKHOLE might
find the same offset as FIOSEEKDATA already has returned for the file.

kib accepted this revision.Aug 15 2019, 3:47 PM
This revision is now accepted and ready to land.Aug 15 2019, 3:47 PM