vm_fault_next() relies upon vm_fault_busy_sleep() to clean up fault
state, and this was not accounted for in commit cff67bc43df1.
Fixes: cff67bc43df1 ("vm_fault: only rely on PG_ZERO when the page was newly allocated")
Differential D56234
vm_fault: Reset m_needs_zeroing in vm_fault_busy_sleep() Authored by markj on Fri, Apr 3, 2:45 PM. Tags None Referenced Files
Subscribers
Details vm_fault_next() relies upon vm_fault_busy_sleep() to clean up fault Fixes: cff67bc43df1 ("vm_fault: only rely on PG_ZERO when the page was newly allocated")
Diff Detail
Event TimelineComment Actions Assuming that this change is right, I'd like to go further and remove the whole PG_ZERO mechanism: it adds non-trivial complexity, I've never seen workloads where it elides a significant number (say, more than 0.1%) of page zero operations, and page-zeroing is cheaper now than it used to be. Any thoughts? Comment Actions The mere fact that we slept on the page somewhere in the shadow hierarchy does not mean that first_m should be zeroed. WRT removing PG_ZERO, do you propose to zeroing page in vm_page_alloc(VM_PAGE_ZERO)? Long time ago we had background thread that zeroed pages, and separate queue for zeroed pages. With addition of the object reservations it is indeed quite unlikely that the page at specific object pindex happen to be zero because it was returned from something like page table page allocator. Later note: removing the PG_ZERO late zeroing there would still not remove the need to explicitly zero the page, since in case the page was found invalid on the object queue. Comment Actions Yes. I removed the background thread, as it was disabled by default and did not work properly with the vm_reserv subsystem (IIRC because free pages belonging to a reservation are counted in v_free_count but are not present in the vm_phys buddy queues, and the background thread did not handle this discrepancy). I also made vm_page_alloc_noobj(VM_ALLOC_ZERO) zero the page, so it is somewhat strange that vm_page_alloc(VM_ALLOC_ZERO) does not.
Also because page table pages are freed to VM_FREEPOOL_DIRECT, whereas fault pages are allocated from VM_FREEPOOL_DEFAULT.
Yes, but IMO we should simply zero it unconditionally. That would remove the need for this extra state tracking in m_needs_zeroing. It would also simplify the page allocator itself. Comment Actions We might call vm_fault_allocate() more than once while handling a given page fault: suppose a top-level object shadows a OBJT_SWAP object with some swap space assigned. We'll allocate the top-level page, set m_needs_zeroing, then allocate a page in the backing object, set m_needs_zeroing again, then try to page in. If the pager doesn't have the page, we'll end up zeroing the top-level page using the wrong value for the flag. Comment Actions Set m_needs_zeroing only for the top-level page, since that's what Comment Actions This is probably fine, but then resetting m_needs_zeroing seems to be somewhat unneeded. I started thinking that m_needs_zeroing should be removed, and when we find an invalid page on the shadowing object queue, we need to zero it immediately. Comment Actions Yes, I think that would be better. But, since we need to release another EN for the bug, I prefer to first commit a minimal patch. I'm using this dtrace script to see how often the page allocator returns pre-zeroed pages: dtrace -n 'vm_page_alloc_domain_iter:return {@a[args[1]->flags & 4] = count();} vm_page_alloc_noobj_domain:return {@b[args[1]->flags & 4] = count();}'The two aggregations effectively capture the number of pre-zeroed pages allocated from the default and direct freepools, respectively. When I run builds, for the default freepool, the count is effectively zero, but for the direct freepool we usually get >50% pre-zeroed pages. This intuitively makes sense since during a build, most of the activity in the direct freepool will come from pmap.c allocating and freeing PTPs, the pmap.c is the main (the only?) source of pre-zeroed pages. So rather than removing PG_ZERO outright, IMO we should instead ignore it when allocating from the default free pool. That is:
This may introduce some unnecessary zeroing, but I believe that would be pretty minimal, and I think it's worth it to avoid these kinds of bugs. Comment Actions On my laptop, a clean buildkernel yields: 0 59295715 4 525 0 1907011 4 2960739 so for the default freepool, we only allocated 525 pre-zeroed pages in total, whereas something like 3/5ths of the pages allocated from the direct freepool were prezeroed. |