Page MenuHomeFreeBSD

Revamp the page fault handler's default page clustering strategy
ClosedPublic

Authored by alc on Jan 11 2015, 7:34 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 27, 1:19 PM
Unknown Object (File)
Oct 3 2024, 3:06 PM
Unknown Object (File)
Oct 1 2024, 12:01 PM
Unknown Object (File)
Oct 1 2024, 5:17 AM
Unknown Object (File)
Sep 18 2024, 5:01 PM
Unknown Object (File)
Sep 13 2024, 11:40 AM
Unknown Object (File)
Sep 5 2024, 9:10 AM
Unknown Object (File)
Sep 4 2024, 8:25 PM
Subscribers
None

Details

Summary

Revamp the default page clustering strategy that is used by the page fault
handler. For roughly twenty years, the page fault handler has used the
same basic strategy: Fetch a fixed number of non-resident pages both ahead
and behind the virtual page that was faulted on. Over the years,
alternative strategies have been implemented for optimizing the handling
of random and sequential access patterns, but the only change to the
default strategy has been to increase the number of pages read ahead to 7
and behind to 8.

The problem with the default page clustering strategy becomes apparent
when you look at how it behaves on the code section of an executable or
shared library. (To simplify the following explanation, I'm going to
ignore the read that is performed to obtain the header and assume that no
pages are resident at the start of execution.) Suppose that we have a
code section consisting of 32 pages. Further, suppose that we access
pages 4, 28, and 16 in that order. Under the default page clustering
strategy, we page fault three times and perform three I/O operations,
because the first and second page faults only read a truncated cluster of
12 pages. In contrast, if we access pages 8, 24, and 16 in that order, we
only fault twice and perform two I/O operations, because the first and
second page faults read a full cluster of 16 pages. In general, truncated
clusters are more common than full clusters.

To address this problem, this revision changes the default page clustering
strategy to align the start of the cluster to a page offset within the vm
object that is a multiple of the cluster size. This results in many fewer
truncated clusters. Returning to our example, if we now access pages 4,
28, and 16 in that order, the cluster that is read to satisfy the page
fault on page 28 will include page 16. So, the access to page 16 will no
longer page fault and perform an I/O operation.

Since the revised default page clustering strategy is typically reading
more pages at a time, we are likely to read a few more pages that are
never accessed. However, for the various programs that we looked at,
including clang, emacs, firefox, and openjdk, the reduction in the number
of page faults and I/O operations far outweighed the increase in the
number of pages that are never accessed. Moreover, the extra resident
pages allowed for many more superpage mappings. For example, if we look
at the execution of clang during a buildworld, the number of (hard) page
faults on the code section drops by 26%, the number of superpage mappings
increases by about 29,000, but the number of never accessed pages only
increases from 30.38% to 33.66%.

In collaboration with: Emily Pettigrew @ Rice Univ.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

alc retitled this revision from to Revamp the page fault handler's default page clustering strategy.
alc updated this object.
alc edited the test plan for this revision. (Show Details)
alc added reviewers: jhb, kib.
vm/vm_fault.c
111 ↗(On Diff #3117)

Should compile-time assertion be added to check that VM_FAULT_READ_DEFAULT is multiple of say, 4 or 8 ?

586 ↗(On Diff #3117)

Could you, please, explain how this formula makes into the formula from comment ?

588 ↗(On Diff #3117)

Even for the sequential faults, would it make sense to read to the supposed block boundary ?

605 ↗(On Diff #3117)

We can read the native file system block size on the vm_object creation, and propagate it to the shadow object on COW mappings. Then, there would be no need to guess block size there.

vm/vm_fault.c
111 ↗(On Diff #3117)

None of the arithmetic below currently depends on VM_FAULT_READ_DEFAULT being a power of 2.

586 ↗(On Diff #3117)

If "read ahead min" is 7, then the sequence of read ahead request sizes will be (sequential fault #, read ahead pages): (0,7), (1,15), (2,24), (3,31), (4,31), (5,31), ...

To be clear, the first time that "fs.pindex == fs.entry->next_read" counts as the zeroth sequential fault.

Essentially, after the zeroth sequential fault, the code takes the previous number of read ahead pages and adds "read ahead min" + 1 to it.

This is the same strategy as before. The code is written a little differently to mesh with the changes to the default strategy, but it should be functionally identical to what it was before.

588 ↗(On Diff #3117)

I'm not sure that explicit action is needed. As a side-effect of the new non-sequential fault strategy, I believe that sequential read ahead will wind up being aligned. Suppose that the first fault happens at an arbitrary point in a file and then we read sequentially. If the non-sequential fault strategy below ends its read at a block boundary, the sequential fault strategy will start out aligned and should stay aligned.

vm/vm_fault.c
111 ↗(On Diff #3117)

To be clear, while I don't think that non-power-of-2 values will ever be useful, I don't think a non-power-of-2 value causes breakage. Normally, I would only add a CTASSERT if breakage would occur.

605 ↗(On Diff #3117)

In general, this would be a good thing to have. Jeff has long complained about the frequency of bogus page usage because it typically implies that the page daemon has reclaimed some but not all pages making up a block, which still leaves us needing to perform an I/O on the block. It would be nice to make the reclamation block aware.

kib edited edge metadata.
This revision is now accepted and ready to land.Jan 12 2015, 11:45 PM
jhb edited edge metadata.

The reasoning sounds good to me and makes sense and the code looks fine as far as I can tell as well.

alc updated this revision to Diff 3212.

Closed by commit rS277255 (authored by @alc).