Page MenuHomeFreeBSD

vm_fault: Reset m_needs_zeroing in vm_fault_busy_sleep()
ClosedPublic

Authored by markj on Fri, Apr 3, 2:45 PM.
Tags
None
Referenced Files
F152511459: D56234.id174916.diff
Wed, Apr 15, 10:24 AM
F152440135: D56234.diff
Tue, Apr 14, 11:44 PM
Unknown Object (File)
Tue, Apr 14, 10:21 AM
Unknown Object (File)
Tue, Apr 14, 8:37 AM
Unknown Object (File)
Tue, Apr 14, 3:42 AM
Unknown Object (File)
Mon, Apr 13, 10:24 PM
Unknown Object (File)
Sat, Apr 11, 8:07 PM
Unknown Object (File)
Fri, Apr 10, 1:03 AM
Subscribers

Details

Summary

vm_fault_next() relies upon vm_fault_busy_sleep() to clean up fault
state, and this was not accounted for in commit cff67bc43df1.

Fixes: cff67bc43df1 ("vm_fault: only rely on PG_ZERO when the page was newly allocated")

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.Fri, Apr 3, 2:45 PM

Assuming that this change is right, I'd like to go further and remove the whole PG_ZERO mechanism: it adds non-trivial complexity, I've never seen workloads where it elides a significant number (say, more than 0.1%) of page zero operations, and page-zeroing is cheaper now than it used to be. Any thoughts?

The mere fact that we slept on the page somewhere in the shadow hierarchy does not mean that first_m should be zeroed.
But, if we go this route, then IMO m_needs_zero should be reset to true right at the RetryFault, and be done with it.

WRT removing PG_ZERO, do you propose to zeroing page in vm_page_alloc(VM_PAGE_ZERO)? Long time ago we had background thread that zeroed pages, and separate queue for zeroed pages. With addition of the object reservations it is indeed quite unlikely that the page at specific object pindex happen to be zero because it was returned from something like page table page allocator.

Later note: removing the PG_ZERO late zeroing there would still not remove the need to explicitly zero the page, since in case the page was found invalid on the object queue.

In D56234#1287011, @kib wrote:

The mere fact that we slept on the page somewhere in the shadow hierarchy does not mean that first_m should be zeroed.
But, if we go this route, then IMO m_needs_zero should be reset to true right at the RetryFault, and be done with it.

WRT removing PG_ZERO, do you propose to zeroing page in vm_page_alloc(VM_PAGE_ZERO)? Long time ago we had background thread that zeroed pages, and separate queue for zeroed pages.

Yes. I removed the background thread, as it was disabled by default and did not work properly with the vm_reserv subsystem (IIRC because free pages belonging to a reservation are counted in v_free_count but are not present in the vm_phys buddy queues, and the background thread did not handle this discrepancy). I also made vm_page_alloc_noobj(VM_ALLOC_ZERO) zero the page, so it is somewhat strange that vm_page_alloc(VM_ALLOC_ZERO) does not.

With addition of the object reservations it is indeed quite unlikely that the page at specific object pindex happen to be zero because it was returned from something like page table page allocator.

Also because page table pages are freed to VM_FREEPOOL_DIRECT, whereas fault pages are allocated from VM_FREEPOOL_DEFAULT.

Later note: removing the PG_ZERO late zeroing there would still not remove the need to explicitly zero the page, since in case the page was found invalid on the object queue.

Yes, but IMO we should simply zero it unconditionally. That would remove the need for this extra state tracking in m_needs_zeroing. It would also simplify the page allocator itself.

Reset m_needs_zeroing after the RetryFault label.

This revision is now accepted and ready to land.Sat, Apr 4, 10:47 AM

We might call vm_fault_allocate() more than once while handling a given page fault: suppose a top-level object shadows a OBJT_SWAP object with some swap space assigned. We'll allocate the top-level page, set m_needs_zeroing, then allocate a page in the backing object, set m_needs_zeroing again, then try to page in. If the pager doesn't have the page, we'll end up zeroing the top-level page using the wrong value for the flag.

Set m_needs_zeroing only for the top-level page, since that's what
vm_fault_zerofill() operates on.

This revision now requires review to proceed.Mon, Apr 6, 4:02 AM

This is probably fine, but then resetting m_needs_zeroing seems to be somewhat unneeded. I started thinking that m_needs_zeroing should be removed, and when we find an invalid page on the shadowing object queue, we need to zero it immediately.

This revision is now accepted and ready to land.Mon, Apr 6, 7:05 AM
In D56234#1287417, @kib wrote:

This is probably fine, but then resetting m_needs_zeroing seems to be somewhat unneeded. I started thinking that m_needs_zeroing should be removed, and when we find an invalid page on the shadowing object queue, we need to zero it immediately.

Yes, I think that would be better. But, since we need to release another EN for the bug, I prefer to first commit a minimal patch.

Assuming that this change is right, I'd like to go further and remove the whole PG_ZERO mechanism: it adds non-trivial complexity, I've never seen workloads where it elides a significant number (say, more than 0.1%) of page zero operations, and page-zeroing is cheaper now than it used to be. Any thoughts?

I'm using this dtrace script to see how often the page allocator returns pre-zeroed pages:

dtrace -n 'vm_page_alloc_domain_iter:return {@a[args[1]->flags & 4] = count();} vm_page_alloc_noobj_domain:return {@b[args[1]->flags & 4] = count();}'

The two aggregations effectively capture the number of pre-zeroed pages allocated from the default and direct freepools, respectively. When I run builds, for the default freepool, the count is effectively zero, but for the direct freepool we usually get >50% pre-zeroed pages. This intuitively makes sense since during a build, most of the activity in the direct freepool will come from pmap.c allocating and freeing PTPs, the pmap.c is the main (the only?) source of pre-zeroed pages.

So rather than removing PG_ZERO outright, IMO we should instead ignore it when allocating from the default free pool. That is:

  1. Make vm_page_alloc_* handle VM_ALLOC_ZERO by zeroing the page unconditionally.
  2. Change the fault handler to ignore PG_ZERO and always zero a new top-level page if there is no backing page.
  3. Change the fault handler to immediately zero a resident top-level page if it is invalid.

This may introduce some unnecessary zeroing, but I believe that would be pretty minimal, and I think it's worth it to avoid these kinds of bugs.

I'm using this dtrace script to see how often the page allocator returns pre-zeroed pages:

dtrace -n 'vm_page_alloc_domain_iter:return {@a[args[1]->flags & 4] = count();} vm_page_alloc_noobj_domain:return {@b[args[1]->flags & 4] = count();}'

The two aggregations effectively capture the number of pre-zeroed pages allocated from the default and direct freepools, respectively. When I run builds, for the default freepool, the count is effectively zero, but for the direct freepool we usually get >50% pre-zeroed pages. This intuitively makes sense since during a build, most of the activity in the direct freepool will come from pmap.c allocating and freeing PTPs, the pmap.c is the main (the only?) source of pre-zeroed pages.

On my laptop, a clean buildkernel yields:

0         59295715
4              525

0          1907011
4          2960739

so for the default freepool, we only allocated 525 pre-zeroed pages in total, whereas something like 3/5ths of the pages allocated from the direct freepool were prezeroed.

This revision was automatically updated to reflect the committed changes.