read-behind / read-ahead support for zfs_getpages()
ClosedPublic
Actions

Authored by avg on Feb 8 2018, 1:53 PM.

Details

Reviewers

kib
mav
asomers

Commits

rS329363: read-behind / read-ahead support for zfs_getpages()

Summary

ZFS caches blocks it reads in its ARC, so in general the optional
pages are not as useful as with filesystems that read the data
directly into the target pages. But still the optional pages
are useful to reduce the number of page faults and associated
VM / VFS / ZFS calls.
Another case that gets optimized (as a side effect) is paging in
from a hole. ZFS DMU does not currently provide a convenient
API to check for a hole. Instead it creates a temporary zero-filled
block and allows accessing it as if it were a normal data block.
Getting multiple pages one by one from a hole results in repeated
creation and destruction of the temporary block (and an associated
ARC header).

Test Plan

Tested with fsx using various supported blocks sizes from 512 bytes
to 128 KB and additionally 1 MB.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

avg created this revision.Feb 8 2018, 1:53 PM

Herald added subscribers: delphij, imp. · View Herald TranscriptFeb 8 2018, 1:53 PM

kib added inline comments.Feb 8 2018, 2:18 PM

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c
1561 ↗	(On Diff #39041)	I am not sure why do you busy the page there. You do not drop the object lock until the page is validated, unless I missed something.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
4534 ↗	(On Diff #39041)	Is this VM_OBJECT_WLOCK() ? If yes, this locking is only useful on 32bit platforms where vnp_size cannot be read atomically. But, the following code uses the obj_size after the unlock, which means that it can operate on the outdated data. For UFS it is not a problem because UFS requires exclusive vnode lock for write, so read prevents the file size changes. For ZFS, with EXTENDED_SHARED, it might become the problem.

avg added inline comments.Feb 8 2018, 2:38 PM

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c
1561 ↗	(On Diff #39041)	Yes, that's true, the object lock is kept over the whole data copy process. I didn't realize that that allows me to not busy the optional pages, but now I see that I can do that. On the other hand, maybe holding the object lock for the whole time is not that good. Having said that, I am not sure if dropping and re-acquiring the object around each optional page copy would be better.
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
4534 ↗	(On Diff #39041)	I think that changing the _file_ size required an exclusive vnode lock. So, it could be the object size could be get of sync, but, at least, the file size (on-disk object size) should not change. Not sure if that's good enough, though.

I confess that I'm pretty ignorant of this part of ZFS. I'm afraid that I won't be of much use as a reviewer.

I haven't looked deep, but the general direction makes sense to me. Since ABD introduction sub-block reads become more expensive even for uncompressed data, since dmu_buf_hold_array() copies whole block from scatter/gather to flat buffers. For compressed difference should be even bigger, but there it at least compensated by increased ARC capacity. We noticed it just few days ago as a cause of significant performance drop in some of benchmarks.

do not busy the optional pages as they are handled while holding the object lock in the exclusive mode (suggested by kib)
as a result the optional pages are explicitly marked active or inactive instead of using vm_page_readahead_finish()
more assertions about the state of the pages before the page-in
range-lock the paged in blocks before reading them, this also ensures that the file size won't change if we page in up to the end of file

Please note that in illumos and ZoL they do not do the range-locking
in the pagin path. This is because ZFS has a double-caching problem
between ARC and page cache and that requires zfs_read() and zfs_write()
to consult pages in the page cache. So, in those functions they first
lock a range and lock pages corresponding to the range. While in the
page-in (and maybe page-out) path they first lock the pages and then
would lock the range. So, they would have a deadlock.

I believe that FreeBSD does not have that problem, because the page-in
deals only with invalid pages while zfs_read() and zfs_write() need to
access only valid pages. They do not wait on a busy page unless it's
already valid.

Harbormaster completed remote builds in B14931: Diff 39144.Feb 10 2018, 3:52 PM

kib accepted this revision.Feb 10 2018, 4:43 PM

This revision is now accepted and ready to land.Feb 10 2018, 4:43 PM

Closed by commit rS329363: read-behind / read-ahead support for zfs_getpages() (authored by avg). · Explain WhyFeb 16 2018, 6:59 AM

This revision was automatically updated to reflect the committed changes.