Page MenuHomeFreeBSD

read-behind / read-ahead support for zfs_getpages()
ClosedPublic

Authored by avg on Feb 8 2018, 1:53 PM.
Tags
None
Referenced Files
F106741476: D14263.diff
Sat, Jan 4, 5:30 PM
Unknown Object (File)
Tue, Dec 31, 11:19 AM
Unknown Object (File)
Nov 21 2024, 12:00 AM
Unknown Object (File)
Nov 1 2024, 11:48 PM
Unknown Object (File)
Nov 1 2024, 10:44 PM
Unknown Object (File)
Oct 9 2024, 10:38 AM
Unknown Object (File)
Sep 30 2024, 12:47 AM
Unknown Object (File)
Sep 24 2024, 6:30 AM
Subscribers

Details

Summary

ZFS caches blocks it reads in its ARC, so in general the optional
pages are not as useful as with filesystems that read the data
directly into the target pages. But still the optional pages
are useful to reduce the number of page faults and associated
VM / VFS / ZFS calls.
Another case that gets optimized (as a side effect) is paging in
from a hole. ZFS DMU does not currently provide a convenient
API to check for a hole. Instead it creates a temporary zero-filled
block and allows accessing it as if it were a normal data block.
Getting multiple pages one by one from a hole results in repeated
creation and destruction of the temporary block (and an associated
ARC header).

Test Plan

Tested with fsx using various supported blocks sizes from 512 bytes
to 128 KB and additionally 1 MB.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
No Lint Coverage
Unit
No Test Coverage
Build Status
Buildable 14870
Build 14983: arc lint + arc unit

Event Timeline

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c
1561

I am not sure why do you busy the page there. You do not drop the object lock until the page is validated, unless I missed something.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
4534

Is this VM_OBJECT_WLOCK() ? If yes, this locking is only useful on 32bit platforms where vnp_size cannot be read atomically.

But, the following code uses the obj_size after the unlock, which means that it can operate on the outdated data. For UFS it is not a problem because UFS requires exclusive vnode lock for write, so read prevents the file size changes. For ZFS, with EXTENDED_SHARED, it might become the problem.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c
1561

Yes, that's true, the object lock is kept over the whole data copy process.
I didn't realize that that allows me to not busy the optional pages, but now I see that I can do that.

On the other hand, maybe holding the object lock for the whole time is not that good.
Having said that, I am not sure if dropping and re-acquiring the object around each optional page copy would be better.

sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c
4534

I think that changing the _file_ size required an exclusive vnode lock.
So, it could be the object size could be get of sync, but, at least, the file size (on-disk object size) should not change.
Not sure if that's good enough, though.

I confess that I'm pretty ignorant of this part of ZFS. I'm afraid that I won't be of much use as a reviewer.

I haven't looked deep, but the general direction makes sense to me. Since ABD introduction sub-block reads become more expensive even for uncompressed data, since dmu_buf_hold_array() copies whole block from scatter/gather to flat buffers. For compressed difference should be even bigger, but there it at least compensated by increased ARC capacity. We noticed it just few days ago as a cause of significant performance drop in some of benchmarks.

  • do not busy the optional pages as they are handled while holding the object lock in the exclusive mode (suggested by kib)
  • as a result the optional pages are explicitly marked active or inactive instead of using vm_page_readahead_finish()
  • more assertions about the state of the pages before the page-in
  • range-lock the paged in blocks before reading them, this also ensures that the file size won't change if we page in up to the end of file

Please note that in illumos and ZoL they do not do the range-locking
in the pagin path. This is because ZFS has a double-caching problem
between ARC and page cache and that requires zfs_read() and zfs_write()
to consult pages in the page cache. So, in those functions they first
lock a range and lock pages corresponding to the range. While in the
page-in (and maybe page-out) path they first lock the pages and then
would lock the range. So, they would have a deadlock.

I believe that FreeBSD does not have that problem, because the page-in
deals only with invalid pages while zfs_read() and zfs_write() need to
access only valid pages. They do not wait on a busy page unless it's
already valid.

This revision is now accepted and ready to land.Feb 10 2018, 4:43 PM
This revision was automatically updated to reflect the committed changes.