This requires some more cleanup and commenting, but is functional and
has survived some stress testing (poudriere, pgbench, a few stress2
tests).
I have a plan to break this up into more digestible reviews, so I don't
think it's worth reading every line of this diff, but please feel free
to comment on the details here.
Background:
Most page queue operations are batched. The requested operation is
encoded by setting aflags and enqueuing the page, and the operation is
carried out with the page queue lock held once a batch is full.
Requests are made while holding the page lock. Some specific
invariants:
- Request flags (PGA_{DEQUEUE,REQUEUE,REQUEUE_HEAD}) are set with the page lock held.
- Request flags are cleared with the page queue lock held, where the page queue corresponds to the page's queue index and NUMA domain index.
- A page's queue field can only be updated while holding the page queue lock corresponding to the queue field's value, or the page lock if the value is PQ_NONE.
- Similarly, PGA_ENQUEUED can be toggled only while holding the page queue lock for the page's queue field.
Changes:
The idea is to remove the page lock from this system and replace it with
cmpxchg loops. With per-page granularity I expect the number of retries
to be small, and I added a counter for them. I added a 32-bit
vm_page_astate_t which contains aflags (widened to 16 bits), the queue
index and act_count. Code which uses aflags not related to page queue
state can use vm_page_aflag_{set,clear}() as before. The change
introduces vm_page_astate_fcmpset(), which behaves as you'd expect.
The common pattern is to use vm_page_astate_load() to atomically load
queue state from a page into a stack variable, perform whatever checks
are needed and possibly abort, copy that state and update as needed, and
call vm_page_pqstate_commit() with the old and new copies.
vm_page_pqstate_commit() attempts to apply the update and also takes
care of physically dequeuing a page in preparation for moving it to
another queue. In particular, in operations which transfer a page
from one queue to another we still must dequeue the page in-place before
creating a batched queue operation to enqueue the page.
In some ways this approach is actually simpler than the old one.
Previously, the aflags and queue fields were updated independently, so
it was necessary to handle inconsistencies. (See the old versions of
vm_page_queue() and vm_page_dequeue_complete() for example, or compare
vm_page_dequeue() with and without this patch). Now, since the aflags
and queue fields are updated atomically, we can get a snapshot of a
page's queue state with vm_page_astate_load(). This also makes it easier
to write assertions.
Details:
There are several scenarios where we perform queue operations, with
different semantics. vm_page_activate(), vm_page_deactivate(),
vm_page_deactivate_noreuse() and vm_page_launder() attempt to enqueue
the page in the corresponding queue. They bail if PGA_DEQUEUE is set
(more on that later). vm_page_activate() just ensures that
act_count >= ACT_INIT if the page is already in PQ_ACTIVE; the others
requeue the page so as to reflect a reference. All of these functions
use vm_page_mvqueue() to perform the queue state update.
Unwiring a page also updates queue state. There are two flavours:
vm_page_unwire(), where the queue index is specified by the caller, and
vm_page_release() and vm_page_release_locked(), which usually put the
page in PQ_INACTIVE. If these functions find that the page is in
PQ_ACTIVE, they set PGA_REFERENCED and leave the page alone. Otherwise
they put it in PQ_INACTIVE. This happens in vm_page_unwire_managed(),
which calls vm_page_release_toq() to perform the queue state update.
vm_page_release_toq() and vm_page_mvqueue() are pretty similar, but
different enough that I think it makes sense to keep them separate.
With this change, wiring also updates queue state. Now that both
(un)wiring and queue state updates are performed without the page lock,
we need to do more work to maintain the lazy dequeue semantics of page
wiring. To this end, vm_page_wire() and vm_page_wire_mapped()
unconditionally set PGA_DEQUEUE. As a result, during page queue scans,
the page daemon can simply check for PGA_DEQUEUE instead of both
PGA_DEQUEUE and wirings before acquiring the object lock.
The page queue scans update act_count and thus also use
vm_page_pqstate_commit().