The change is somewhat large, sorry. I will try to explain the gist of it,
and the provide a list of things which have changed.
The aim is to reduce contention on page queue locks. Right now, both
the page and page queue locks need to be held to enqueue, requeue or
dequeue a page. Of course, this is not very scalable, and it exacerbates
page lock contention because page locks are first in the lock order.
Consider that we hold the queue lock for the entirety of a PQ_ACTIVE
scan. If a thread attempts to enqueue a page there, it will block with
a page lock held until the scan is complete.
The approach here is to separate queue operations (enqueue, dequeue
and requeue) into two phases. The first phase requires only the page lock,
and schedules a deferred queue operation using per-CPU batch queues.
There is one batch queue per CPU per page queue. Operations are encoded
using atomic flags in the page. The second phase, implemented in
vm_pqbatch_process(), processes a batch queue with the page queue lock
held and carries out the requested queue operations. The second phase is
performed only when the batch is full, so operations on a given page
may be deferred indefinitely.
vm_page_enqueue() and vm_page_requeue() always perform deferred
operations. Higher-level APIs (e.g., vm_page_deactivate()) thus perform
deferred queue operations as well. vm_page_dequeue() guarantees that
the page is dequeued before the function returns, and
vm_page_dequeue_deferred() performs a deferred dequeue. vm_page_dequeue()
requires both the page and page queue locks unless a deferred dequeue was
already requested for the page, in which case only the queue lock is required.
The locking protocol for the queue field of struct vm_page is changed.
The field is only allowed to transition between PQ_NONE and a queue index,
i.e., it cannot transition directly between queue indices. To update the field, the
lock for the from-value must be held. For PQ_NONE this is the page lock,
otherwise it is the corresponding page queue lock. There is one place where
we safely violate this rule for an optimization: in the inactive queue scan,
right before freeing the page. There, we set the field to PQ_NONE directly
with the page lock held. At that point, it is known that the page is physically
removed from the queue and that no queue operations are scheduled, so
the queue lock is not needed in order to complete removal of the page.
Changes:
- vm_phys uses the listq field for freelists rather than plinks.q. We now permit freed pages to reside on page queues. Such pages must be schedule for a deferred dequeue. The page allocators complete the dequeue before returning the page.
- The page daemon scan loops are substantially different. The idea now is to quickly collect a batch of pages with only the page queue lock held, and then process that batch without touching the page queue lock. This lets us get rid of some of the dancing that must occur to acquire the page and object locks with the page queue lock held.
- When collecting a batch during the PQ_INACTIVE scan, pages in the batch are dequeued, in the anticipation that most of them will be freed. For PQ_ACTIVE and PQ_LAUNDRY scans, we keep pages on the queue: during a PQ_ACTIVE scan, we end up requeuing most pages, and during a PQ_LAUNDRY scan, we keep pages queued until laundering is done.
- The lock dancing in vm_object_terminate_pages() is gone. vm_page_free_prep() schedules a deferred dequeue for the page, so the dequeue operations are already batched. Similarly, now that we use a UMA cache for FREEPOOL_DEFAULT pages, most calls to vm_page_free() do not acquire the free queue lock.
- The PQ_ACTIVE scan is implemented using the CLOCK algorithm. This is to avoid requeue operations during the scan.
- vm_page_deactivate_noreuse() uses a separate set of per-CPU batch queues to implement insertions near the head of the queue. I'm open to suggestions on other ways to implement this.