ZFS caches blocks it reads in its ARC, so in general the optional
pages are not as useful as with filesystems that read the data
directly into the target pages. But still the optional pages
are useful to reduce the number of page faults and associated
VM / VFS / ZFS calls.
Another case that gets optimized (as a side effect) is paging in
from a hole. ZFS DMU does not currently provide a convenient
API to check for a hole. Instead it creates a temporary zero-filled
block and allows accessing it as if it were a normal data block.
Getting multiple pages one by one from a hole results in repeated
creation and destruction of the temporary block (and an associated
ARC header).
Details
- Reviewers
kib mav asomers - Commits
- rS329363: read-behind / read-ahead support for zfs_getpages()
Tested with fsx using various supported blocks sizes from 512 bytes
to 128 KB and additionally 1 MB.
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Not Applicable - Unit
Tests Not Applicable
Event Timeline
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c | ||
---|---|---|
1561 ↗ | (On Diff #39041) | I am not sure why do you busy the page there. You do not drop the object lock until the page is validated, unless I missed something. |
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c | ||
4534 ↗ | (On Diff #39041) | Is this VM_OBJECT_WLOCK() ? If yes, this locking is only useful on 32bit platforms where vnp_size cannot be read atomically. But, the following code uses the obj_size after the unlock, which means that it can operate on the outdated data. For UFS it is not a problem because UFS requires exclusive vnode lock for write, so read prevents the file size changes. For ZFS, with EXTENDED_SHARED, it might become the problem. |
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/dmu.c | ||
---|---|---|
1561 ↗ | (On Diff #39041) | Yes, that's true, the object lock is kept over the whole data copy process. On the other hand, maybe holding the object lock for the whole time is not that good. |
sys/cddl/contrib/opensolaris/uts/common/fs/zfs/zfs_vnops.c | ||
4534 ↗ | (On Diff #39041) | I think that changing the _file_ size required an exclusive vnode lock. |
I confess that I'm pretty ignorant of this part of ZFS. I'm afraid that I won't be of much use as a reviewer.
I haven't looked deep, but the general direction makes sense to me. Since ABD introduction sub-block reads become more expensive even for uncompressed data, since dmu_buf_hold_array() copies whole block from scatter/gather to flat buffers. For compressed difference should be even bigger, but there it at least compensated by increased ARC capacity. We noticed it just few days ago as a cause of significant performance drop in some of benchmarks.
- do not busy the optional pages as they are handled while holding the object lock in the exclusive mode (suggested by kib)
- as a result the optional pages are explicitly marked active or inactive instead of using vm_page_readahead_finish()
- more assertions about the state of the pages before the page-in
- range-lock the paged in blocks before reading them, this also ensures that the file size won't change if we page in up to the end of file
Please note that in illumos and ZoL they do not do the range-locking
in the pagin path. This is because ZFS has a double-caching problem
between ARC and page cache and that requires zfs_read() and zfs_write()
to consult pages in the page cache. So, in those functions they first
lock a range and lock pages corresponding to the range. While in the
page-in (and maybe page-out) path they first lock the pages and then
would lock the range. So, they would have a deadlock.
I believe that FreeBSD does not have that problem, because the page-in
deals only with invalid pages while zfs_read() and zfs_write() need to
access only valid pages. They do not wait on a busy page unless it's
already valid.