Page MenuHomeFreeBSD

Detect reads from the hole.
ClosedPublic

Authored by kib on Mar 31 2018, 11:31 AM.
Tags
None
Referenced Files
Unknown Object (File)
Dec 22 2023, 11:02 PM
Unknown Object (File)
Aug 7 2023, 5:16 PM
Unknown Object (File)
Aug 2 2023, 6:05 PM
Unknown Object (File)
Jul 15 2023, 4:39 PM
Unknown Object (File)
Jan 1 2023, 12:11 AM
Unknown Object (File)
Nov 28 2022, 8:54 PM
Subscribers

Details

Summary

If UFS file is read from the hole, detect it and optimize by avoiding instantiating the buffer, allocating pages and bmap-ing it.

TODO: calculate the number of blocks used by fully populated file of the given size and use it to cheaply avoid bmap.

Suggested by: jeff

Test Plan

Peter, could you, please, test this change ?

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/ufs/ffs/ffs_vnops.c
572–576 ↗(On Diff #40939)

The problem with this is paying the cost of bmap on every read which is very expensive. Ideally we'd bmap before allocating pages but after looking up the buf.

Maybe the buffer cache could detect this and we could use a new GB_ flag.

sys/ufs/ffs/ffs_vnops.c
572–576 ↗(On Diff #40939)

If indirect blocks are already read, then the cost is mostly the call and the walk over the pointers. I do not see how would the buffer cache detect the hole without doing VOP_BMAP() as well.

I thought about adding GB flag and failing bread() with specific error code, but this only changed the place where VOP_BMAP() is done. I will look some more at this.

The comment indicates how the normal reads of non-sparse files can be optimized.

For what it's worth, I did test this with my sparse file dd test and we well exceed the performance of linux at this benchmark now that we're using the same technique. Unfortunately it defeats a convenient way to create a lot of paging traffic.

sys/ufs/ffs/ffs_vnops.c
572–576 ↗(On Diff #40939)

The cost is going to be the getblk() call for every level of indirection. At high data rates getblk()/brelse() consumes a large fraction of the time.

bread could see if the data is cached. If not it will have to do bmap in any event. So you could lookup the buf, lookup the pages, if they don't exist, bmap, then vm_page_grab if it is not sparse. If it is sparse, you could create a buf that holds pointers to zero pages which have to be fixed on write. Or you could return an error code.

In D14917#313652, @jeff wrote:

For what it's worth, I did test this with my sparse file dd test and we well exceed the performance of linux at this benchmark now that we're using the same technique. Unfortunately it defeats a convenient way to create a lot of paging traffic.

Are you saying that the patch from the review helps, or we outperform linux in the test already, without the patch ?

This is another variant, which short-circuits ufs_strategy() right after bmap is done. As result, we deallocate the fresh buffer for hole instead of leaving it to buffer daemon, and do not instantiate file blocks.

This version is much slower than the other version but still much faster than head. I think there is some bug though because after a while it started consuming space.

I think if we can save the page allocation by doing bmap earlier as I described and make a uiozero this will be as fast as the first version if not faster.

sys/kern/vfs_bio.c
3892 ↗(On Diff #40943)

I would prefer we keep with the sparse terminology.

4105 ↗(On Diff #40943)

Some prefix indicating not to return sparse would be better. NOSPARSE, FAILSPARSE, etc.

sys/ufs/ffs/ffs_vnops.c
474 ↗(On Diff #40943)

Do we not have a uiozero()? If not we need one. It will save a memory fetch and cache pollution.

622–623 ↗(On Diff #40943)

This part I like much better overall. No impact to the normal path.

We could do the VOP_BMAP before strategy though to avoid even more work. If the filesystem passes in a parameter saying it doesn't want sparse bufs we can assume VOP_BMAP will work and do it right away in getblk(). It saves us from needing the B_FLAG to preserve an argument to getblk().

Ideally this could be done before pages are allocated. That should work.

Rename _HOLE to _NOSPARSE.
Fix bug where vn_io_fault() holds were ignored.

Really commit and regenerate diff for flags renames.

Rework the patch to not even create buffer and to not allocate page for holes, by doing BMAP early. Also, the BMAP result is saved and the block number is not recalculated in the strategy.

We should think about what other filesystems could be trivially converted to this interface.

sys/kern/vfs_bio.c
2149–2152 ↗(On Diff #41211)

The only thing I don't really like in this version of the patch is the error handling. This is good to test though.

We could make an internal version of getblk that returns an error and takes a buf **. It may be useful in other circumstances.

sys/ufs/ffs/ffs_vnops.c
474 ↗(On Diff #40943)

again is there no uiozero? Can we do that?

sys/ufs/ffs/ffs_vnops.c
474 ↗(On Diff #40943)

It is easy to implement, but where do you intend to use uio_zero() ? I can think about /dev/zero zero_read() and uiomove_object_page().

Note that ffs_read_hole() is not a suitable user of the function, due to vn_io_fault().

Also, what do you mean by the memory fetch save ?

Add getblkx().

One moment which I am unsure about are the error returns from cluster_read() and bread(). Now that getblkx() can return a wide spectrum of errors, I keep the leaking out of the read functions. I think it is useful to provide the detailed classification of errors from getblkx(). I can translate all of them except EJUSTRETURN back to EBUSY, as it was before, but I would try to leave as is.

I investigated some limited number of callers, and I believe that it is fine. If testing reveals something, I will convert to EBUSY in cluster_read() and breadn_flags().

Do not optimize UIO_NOCOPY. In this case, the pages are already allocated.

It would be nice to implement it in other filesystems that support sparse files.

This revision is now accepted and ready to land.May 11 2018, 9:23 AM
In D14917#324431, @jeff wrote:

It would be nice to implement it in other filesystems that support sparse files.

Which ? The only other significant writeable filesystem which uses buffer cache is msdosfs, where we cannot have a hole on FAT. Support for holes read optimization could be useful for zfs and tmpfs. For tmpfs we already have the short-circuit to avoid instantiate the backing page on read. ZFS is out of my interest.

kib added a subscriber: pho.

Tested with all of the stress2 tests + a buildworld / installworld.
No problems seen.

This revision was automatically updated to reflect the committed changes.