Page MenuHomeFreeBSD

ffs: fix vn_read_from_obj() usage for PAGE_SIZE > block size
ClosedPublic

Authored by chs on Apr 7 2022, 8:15 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Apr 20, 6:19 PM
Unknown Object (File)
Sat, Apr 20, 6:19 PM
Unknown Object (File)
Sat, Apr 20, 6:19 PM
Unknown Object (File)
Sat, Apr 20, 6:19 PM
Unknown Object (File)
Sat, Apr 20, 6:07 PM
Unknown Object (File)
Mon, Apr 8, 6:43 PM
Unknown Object (File)
Sun, Apr 7, 3:52 PM
Unknown Object (File)
Thu, Mar 28, 5:22 AM
Subscribers
None

Details

Summary

vn_read_from_obj() requires that all pages of a vnode (except the last
partial page) be either completely valid or completely invalid,
but for file systems with block size smaller than PAGE_SIZE,
partially valid pages may exist anywhere in the file.
Do not enable the vn_read_from_obj() path in this case.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

chs requested review of this revision.Apr 7 2022, 8:15 PM
chs created this revision.

I really do not understand this. Basically, you mean that for e.g. pages of 16k and blocks of 4k, we get a page with the 4k valid data, and then the 'tail' of 12k unused. But how would such file appear mapped? Imagine that we need to mmap 32k from the file. Then hardware can only map two page for there, and for userspace it would be seen as 4k valid/12k zeroes/next 4k (which in fact should start at the place where 12k started)/12k zeroes.

Do I miss something?

No, in this example of 16k pages and 4k blocks, the first page would contain the first 4 blocks, the second page would contain the next 4 blocks, etc, so mmap'ing the file works as you would expect.

What I meant about partially valid pages is that the buffer cache assumes that it's fine to read any one buffer/block separately from any other buffer/block. The buffer cache will only do one disk read to read a buffer, so if the blocks that back a particular page are discontiguous on disk, then that one disk read can only make part of the underlying page valid. The vnode pager has code to make a page fully valid by issuing multiple disk reads if needed, but the buffer cache doesn't do that. This change was just intended to make the 16k page case work at all, and further optimization can be done later.

Looking at this again, I realized that this change does have the effect of disabling this fast path for the page containing EOF even for 4k page kernels, which wasn't intended. For now we could make the skip condition different depending on page size, eg.:

if (PAGE_SIZE > 4096 ? !vm_page_all_valid(ma[i]) : vm_page_none_valid(ma[i])) ...

That would preserve the existing behavior for 4k page kernels while also letting 16k page kernels work.

sys/kern/vfs_vnops.c
1035 ↗(On Diff #104747)

There is this calculation of the EOF, you might remember if not fully valid page appeared somewhere and then recheck that it is indeed at EOF.

IMO adding PAGE_SIZE == 4k check is between gross and bug. In fact, you want a flag for filesystem if partially valid pages are indeed possible at EOF only.

sys/kern/vfs_vnops.c
1035 ↗(On Diff #104747)

Would it be sufficient to add a new vm_page_prefix_valid() (or whatever) that returns true when the page is 1) not completely invalid, and 2) all invalid chunks in the page are contiguous and aren't followed by valid chunks?

bool
vm_page_prefix_valid(vm_page_t m)
{
    vm_page_bits_t valid;

    valid = m->valid;
    if (valid == VM_PAGE_BITS_ALL)
        return (true);
    valid ^= valid >> 1;
    return ((valid & 1) != 0 && __bitcountl(valid) == 2);
}

Then we don't need a PAGE_SIZE == 4K check (or maybe it could be hidden in the vm_page layer).

sys/kern/vfs_vnops.c
1035 ↗(On Diff #104747)

But what if the page is not at EOF and the next page against contains valid content?

IMO the right solution for UFS is to not set VIRF_PGREAD at all if page size is larger than the fragment size.

I like the approach suggested by kib that ufs_open() not set VIRF_PGREAD when the page size is larger than the fragment size.

chs retitled this revision from vfs: fix vn_read_from_obj() for PAGE_SIZE > block size to ffs: fix vn_read_from_obj() usage for PAGE_SIZE > block size.
chs edited the summary of this revision. (Show Details)

Updated the diff based on review comments.
The suggestion was to only enable vn_read_from_obj() if the fs fragment size is larger than the page size, but it's actually fine to use vn_read_from_obj() as long as the fs block size is greater than or equal to the page size, so I made the check that way.

sys/ufs/ufs/ufsmount.h
99

I do not think block size needs to be 64bit.

This revision is now accepted and ready to land.Jun 22 2022, 5:06 AM