There are actually two unrelated changed.
First one, stops taking the object wlock after the pmap_enter() is done for the main vm_fault flow. The lock is not needed there for anything, I believe, except for dropping the paging in progress count. I moved the pip decrement before pmap_enter() while still owning vm object lock. This makes it possible for vm_object_terminate_pages() to find a busy page on the object queue. Instead of asserting that the pages are not busy, I restart the termination loop after the state passed. I expect that vm_page_free_prep() is idempotent.
Second one is less delicate. For the vm_fault_prefault() call from vm_fault_soft_fast(), extend the scope of the object rlock to avoid re-taking it inside vm_fault_prefault(). It causes pmap_enter_quick() sometimes called with shadow object lock as well as the page lock, but this looks innocent.
Both cases were pointed out by mjg. who also benchmarked the change. The patches were not tested otherwise yet.