Details

Reviewers

rmacklem
jrtc27
imp
kevans
mav
asomers

Summary

When using copy_file_range() on ZFS, pathconf(..., _PC_MIN_HOLE_SIZE)
returns 512 bytes, as this is the smallest possible granularity of
holes in ZFS due to dynamic record sizes, and the expectation that
all hole locations will be aligned to this size.

copy_file_range() enforced a minimum block size of 4096 bytes, but
this resulted in much worse performance than copying using a more
reasonable block size, such as the recordsize.

This was reported as up to a 70% performance regression in cp(1)
compared to 12.x which did not have copy_file_range() and used
a userspace buffer of MAXPHYS * 8 (1 MB in most cases).

If the minimum hole size is less than or equal to 512 bytes, we
instead opt to use the block size reported by the filesystem, which
in the ZFS case will be the recordsize of the dataset.

While here, replace the static upper bound on the copy_file_range()
block size of 1 MB with maxphys, so it can be larger when the system
is configured for such.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 46936
Build 43825: arc lint + arc unit

Event Timeline

allanjude created this revision.Aug 13 2022, 8:21 PM

Herald added a subscriber: imp. · View Herald TranscriptAug 13 2022, 8:21 PM

allanjude requested review of this revision.Aug 13 2022, 8:21 PM

Harbormaster completed remote builds in B46926: Diff 109302.Aug 13 2022, 8:21 PM

allanjude added reviewers: rmacklem, jrtc27, imp, kevans, mav.Aug 13 2022, 8:22 PM

Fix missing ;

Harbormaster completed remote builds in B46927: Diff 109303.Aug 13 2022, 8:24 PM

Set blksize to match xfer in the maxphys case, because vn_write_outvp() copies in blksize chunks

Harbormaster completed remote builds in B46928: Diff 109304.Aug 13 2022, 8:57 PM

With the previous change, we don't need the condition when resetting xfer at the end of the loop

Harbormaster completed remote builds in B46929: Diff 109307.Aug 13 2022, 10:26 PM

This basically looks ok to me. However, I have been recently playing with
sparse files on ZFS and I don't seem to be able to create an imbedded
hole (one not to EOF) which is less than 128K and I seem to be able to
create a hole of just about any size at EOF (truncate grows file size).

As such, the 512 seems bogus to me and I wonder if ZFS should be
reporting 182K instead?

I've also noticed significant delay when doing SEEK_HOLE/SEEK_DATA
on a ZFS store (compile time for a kernel on an NFS mount increased by
60-70% when the server was checking for holes). I am doing this for
what is called ReadPlus (which is like Read, but can represent holes via
<offset, len> in the reply instead of putting lots of 0s on the wire).

Although maintaining holes in the output file is a nice thing to do,
I'm not sure doing chunks as small as 4K makes sense, from a performance
point of view?

Also, ZFS produces some pretty weird numbers for va_size and va_bytes.
I just saw a case in my testing where va_size=5723 and va_bytes=4608.
(Why would it allocate 4608 bytes?)

In D36194#821555, @rmacklem wrote:

This basically looks ok to me. However, I have been recently playing with
sparse files on ZFS and I don't seem to be able to create an imbedded
hole (one not to EOF) which is less than 128K and I seem to be able to
create a hole of just about any size at EOF (truncate grows file size).

How I created a whole in the middle of a file:

dd if=/dev/random of=/tpool/test/rfile bs=1m count=20
dd if=/dev/random of=/tpool/test/rfile oseek=30 bs=1m count=10 conv=notrunc

This creates a file with 20mb of data, a 10mb hole, and then another 10mb of data.

Re: 128k: this will depend on the recordsize (default 128kb)

You can change the recordsize of a dataset at any time. It will only impact newly created files (or files less than one whole record, since they'll get recreated as a new single record)

zfs set recordsize=16k zroot/dataset

then create a new file, and you'll get 16kb granularity on the holes.

Note: it may be helpful to zfs set compression=off pool/dataset on whatever you are testing, as ZFS detects runs of all 0s and turns them into a hole, which may not be what you want when experimenting.

As such, the 512 seems bogus to me and I wonder if ZFS should be
reporting 128K instead?

512 is the smallest it could ever be. From the pathconf man page:

_PC_MIN_HOLE_SIZE
        If a file system supports the reporting of holes (see lseek(2)),
        pathconf() and fpathconf() return a positive number that
        represents the minimum hole size returned in bytes.  The offsets
        of holes returned will be aligned to this same value.  A special
        value of 1 is returned if the file system does not specify the
        minimum hole size but still reports holes.

Since the recordsize can change, and whatever the current setting is, an existing file could be different, it seems that reporting 512 (the minimum block size on ZFS) is the only way to guarantee that the offsets of the holes will be aligned to _PC_MIN_HOLE_SIZE

I've also noticed significant delay when doing SEEK_HOLE/SEEK_DATA
on a ZFS store (compile time for a kernel on an NFS mount increased by
60-70% when the server was checking for holes). I am doing this for
what is called ReadPlus (which is like Read, but can represent holes via
<offset, len> in the reply instead of putting lots of 0s on the wire).

What version of ZFS? (zfs version).

https://github.com/openzfs/zfs/commit/05b3eb6d232009db247882a39d518e7282630753

There is sysctl vfs.zfs.dmu_offset_next_sync that controls ZFS will force a sync to put the system in a consistent state before SEEK_HOLE/SEEK_DATA to ensure it gives accurate results. If flipping this sysctl solves the issue, it may give you a much narrower scope of code to look into.

Although maintaining holes in the output file is a nice thing to do,
I'm not sure doing chunks as small as 4K makes sense, from a performance
point of view?

That is what lead me here, a user reporting upgrading from 12.x to 13.x (when cp switched to copy_file_range()) saw performance drop from 2+ GB/sec to 600 MB/sec. When I looked into it, it was because it was copying 4 KB at a time on 13, because that was the lower bounds in copy_file_range()

I wouldn't be opposed to increasing the lower bounds to at least 16kb or something.

Also, ZFS produces some pretty weird numbers for va_size and va_bytes.
I just saw a case in my testing where va_size=5723 and va_bytes=4608.
(Why would it allocate 4608 bytes?)

This might be related to compression.

get the inode of the file and do:

zdb -dddddd -bbbbbb pool/dataset inodenum

and it will explain a bit more like this:

Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
     2    2   128K     1M  30.0M     512    40M   75.00  ZFS plain file (K=inherit) (Z=inherit=uncompressed)

(the file is 30mb on disk, but 40mb logically)

A few comments:

maxphys is not really optimal for an NFS mount, although it is probably fine. Ideally it would be an exact multiple of mnt_stat.f_iosize (which will normally be the case for 1Mbyte). (I know 1Mbyyte is the default for maxphys these days, but...)

4K is too small. It was only meant to be a sanity check and I was probably thinking of the smallest UFS file system. So, for the "check for 0s" case, which could occur when it is copying between ZFS and a file system that does not support SEEK_DATA/SEEK_HOLE, I think f_iosize would be a better bet.

--> See my inline suggestion, which uses f_iosize if the holesiz is

up to 512 (co-incidentally catching this ZFS weirdness).

sys/kern/vfs_vnops.c
3266–3277	One suggestion is to replace the above line with if (blksize <= 512) so that f_iosize gets used for the case where one file system is ZFS and the other does not support holes.
3281–3283	I/O through the VFS is not really tied to maxphys as far as I know, so I'm not sure using maxphys is better than a hard wired value like 1Mbyte (which is the same at this time, but who knows what maxphys might become?).

allanjude added inline comments.Aug 14 2022, 4:54 AM

sys/kern/vfs_vnops.c
3281–3283	The original reason I did this during experimentation, is it made it easier to try larger and smaller values. If we use f_iosize by changing the above if from 1 to 512, then we can throw away most of the rest of the patch. In ZFs, you could have an f_iosize as large as 16 MB. 1 MB feels reasonable as the upper bound here, but when it is a static value with no control, it feels a bit odd. maxphys might not be the right thing, but, it felt a bit like, if we are on 32bit arm and other platforms where maxphys is reduced to 128 KB, it maybe makes sense to reduce the chunk size here (and therefore the malloc() etc), and if it is larger, then there is likely no reason to do it in smaller chunks.

rmacklem added inline comments.Aug 14 2022, 2:13 PM

sys/kern/vfs_vnops.c
3281–3283	All good points. I think that maxphys is reasonable, based on the fact it can be adjusted and may be smaller on 32bit arches. Maybe a comment explaining why it is used as you described above, could be useful. (I, for one, might go "why is this tied to maxphys" when I look at the code a few years from now, so a comment would help in that case, I think.) 16Mbytes does sound too high for the "look for zeros" case.

rmacklem added inline comments.Aug 14 2022, 2:18 PM

sys/kern/vfs_vnops.c
3266–3277	If you do choose to do this, I think a comment here would also be a good idea, so that "512" doesn't seem to be "magic". At this point, choose whatever you think is best and I'll click reviewed on whatever you choose.

Update based on feedback from rmacklem

Harbormaster completed remote builds in B46935: Diff 109324.Aug 14 2022, 2:22 PM

allanjude retitled this revision from copy_file_range: copy maxphys sized chunks unless hole detection fails to copy_file_range: If _PC_MIN_HOLE_SIZE is small, use mnt_stat.f_iosize.Aug 14 2022, 2:27 PM

allanjude edited the summary of this revision. (Show Details)

Add comment as recommended by rmacklem

Harbormaster completed remote builds in B46936: Diff 109325.Aug 14 2022, 2:28 PM

I've also updated the commit message to reflect the changes, can you give it a look over and make sure you agree with it. Thank you.

This version looks fine to me.

I do have a couple more comments....
Would allowing a greater than maxphys size for the canseek
case make sense (ie. up to 16Mbytes for ZFS instead of 1Mbyte)?
Also, you might want to run this past asomers@, for fuse, etc.

This revision is now accepted and ready to land.Aug 14 2022, 4:00 PM

In D36194#821698, @rmacklem wrote:

This version looks fine to me.

I do have a couple more comments....
Would allowing a greater than maxphys size for the canseek
case make sense (ie. up to 16Mbytes for ZFS instead of 1Mbyte)?

A larger size certainly could make sense. ZFS doesn't go beyond 1 MB without tuning other sysctls, so I don't know if it makes sense to dig too deeply there.

Also, you might want to run this past asomers@, for fuse, etc.

Good idea, added.

As far as fusefs goes, we don't know the true blocksize. The FUSE protocol gives us no way of knowing. So using f_iosize is probably the best that we can do, and that's what this patch will do.

Bu I have another question: why does ZFS always return 512B for _PC_MIN_HOLE_SIZE ? pathconf(2) operates on individual paths, so it could return VTOZ(vp)->z_blksz . But I'm not sure what that will show for an empty file; we must test it.

sys/kern/vfs_vnops.c
3266–3277	I don't think we should ever use a blksize less than f_iosize. With ZFS, writing 4 kB chunks to a file system with a 128 kB record size will result in a lot of read-modify-writes. They'll mostly be resolved in memory before committing to disk, but it's still a lot of extra work.
3269	s/to/is/

rmacklem added inline comments.Aug 14 2022, 8:40 PM

sys/kern/vfs_vnops.c
3266–3277	For the case where the input file is on a file system that does not support holes, the intent was to set blksize to the smallest size of hole supported by the output file. If, for example, the output file was on UFS with a 32K blocksize, then by using 32K blocks aligned to 32K boundaries, it would scan a 32K block read from the input file and, if it were all zeros, would not do the write to the output file (creating an unallocated hole and not a block of bytes set to 0). If it uses a larger block size, then it might miss a case where an unallocated hole could be created on the output file. For the case where the input file supports holes, I tend to agree that using f_iosize should be fine and appropriate. (I'll admit to only knowing a little about UFS and nothing about ZFS, I had assumed that using the hole size would perform as well as larger sizes and that f_iosize is normally set to the same value as the hole size. My bad...)

Looks good to me, but as was told it would be good for ZFS to report more realistic minimal hole, if possible, rather than absolute minimum. I'd also like this to be able to copy multiple blocks at a time, that may be very helpful for small blocks and sync=always.

sys/kern/vfs_vnops.c
3392	I am not sure it is so predictable, neither it makes much sense. But even if so, you could avoid the branching completely by merging this block with the previous.

In D36194#821894, @mav wrote:

Looks good to me, but as was told it would be good for ZFS to report more realistic minimal hole, if possible, rather than absolute minimum. I'd also like this to be able to copy multiple blocks at a time, that may be very helpful for small blocks and sync=always.

Yeah, this gets back to my earlier version, where I was trying to always do maxphys in the cantseek=false case (filesystems support seek data/hole). Maybe instead it make sense to do something like 8 * f_iosize to always do multiple blocks? instead of it being based on maxphys?

Is there an upper limit we want to set, to cap the size of the malloc() of dat?

sys/kern/vfs_vnops.c
3392	Good suggestion

In D36194#821718, @asomers wrote:

As far as fusefs goes, we don't know the true blocksize. The FUSE protocol gives us no way of knowing. So using f_iosize is probably the best that we can do, and that's what this patch will do.

Bu I have another question: why does ZFS always return 512B for _PC_MIN_HOLE_SIZE ? pathconf(2) operates on individual paths, so it could return VTOZ(vp)->z_blksz . But I'm not sure what that will show for an empty file; we must test it.

It does look like we can make _PC_MIN_HOLE_SIZE smarter, and that might also improve things here.

In D36194#822149, @allanjude wrote:

Yeah, this gets back to my earlier version, where I was trying to always do maxphys in the cantseek=false case (filesystems support seek data/hole). Maybe instead it make sense to do something like 8 * f_iosize to always do multiple blocks? instead of it being based on maxphys?

8 * may vary from insanely small 4KB to huge 128MB. I am not sure whether f_iosize may ever be not power of 2 and so not multiple of maxphys, but if so, we could possibly calculate the biggest power of 2 multiple of f_iosize below/around maxphys.

Is there an upper limit we want to set, to cap the size of the malloc() of dat?

It is only limited by amount of sequential KVA, that should not be a huge deal at least on amd64 with ZFS using direct map allocations for the most of ARC. ZFS has maximum transaction size (IIRC 32MB now?), so using half of it should probably give maximum throughput on large systems and going beyond that is pointless. For smaller systems though I have no good ideas.

It isn't coded this way, but I believe the algorithm is as follows:

if input file does not support SEEK_HOLE, but the output file does

(A) - Set the blksize to the hole size for the output file, unless it is
   too small (the <= 512 case. This is the case where the "scan for zeros"
   finds hole in the input file and uses the hole size of the output file
   for both alignment and length, so that the output file does not
   allocate for the hole found by scanning for zeros.

else

(B) - Set the blksize to whatever performs well, with some sort of
   sanity limit, so that malloc() is ok.

When I originally wrote the code, I thought MAX(in_holesiz, out_holesiz)
would be a good fit for (B), but that obviously didn't work for ZFS.
What is a good value?

On mav@'s comment about parallelism...
It would be weird for a syscall to use multiple kernel threads (via taskqueue or..)
to do copies concurrently, but copy_file_range(2) is a weird syscall in any case.
I think this might be worth discussing on freebsd-fs@, but seems beyond what
this review should discuss.
I'll post on freebsd-fs@ and see what others think.

rmacklem mentioned this in D37076: Fix blksize selection in vn_generic_copy_file_range() for ZFS.Oct 21 2022, 12:58 AM

I just created an alternate patch. The main difference is
that it includes rationale in comments, so that it hopefully
does not appear to be "magic math" to code readers in the
future.

I do not care which patch (or a future version of either one)
gets committed, but I think something should go into main, etc.

My variant is D37076.

With D37076 committed should this be abandoned?

I believe so, yes.

copy_file_range: If _PC_MIN_HOLE_SIZE is small, use mnt_stat.f_iosize
AcceptedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 109325

sys/kern/vfs_vnops.c

copy_file_range: If _PC_MIN_HOLE_SIZE is small, use mnt_stat.f_iosizeAcceptedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 109325

sys/kern/vfs_vnops.c

copy_file_range: If _PC_MIN_HOLE_SIZE is small, use mnt_stat.f_iosize
AcceptedPublic
Actions

Revision Contents
Changeset List