Introduce atomic per-page queue state operations.
AbandonedPublic
Actions

Authored by markj on Sep 11 2019, 11:57 PM.

Details

Reviewers

Summary

This requires some more cleanup and commenting, but is functional and
has survived some stress testing (poudriere, pgbench, a few stress2
tests).

I have a plan to break this up into more digestible reviews, so I don't
think it's worth reading every line of this diff, but please feel free
to comment on the details here.

Background:
Most page queue operations are batched. The requested operation is
encoded by setting aflags and enqueuing the page, and the operation is
carried out with the page queue lock held once a batch is full.
Requests are made while holding the page lock. Some specific
invariants:

Request flags (PGA_{DEQUEUE,REQUEUE,REQUEUE_HEAD}) are set with the page lock held.
Request flags are cleared with the page queue lock held, where the page queue corresponds to the page's queue index and NUMA domain index.
A page's queue field can only be updated while holding the page queue lock corresponding to the queue field's value, or the page lock if the value is PQ_NONE.
Similarly, PGA_ENQUEUED can be toggled only while holding the page queue lock for the page's queue field.

Changes:
The idea is to remove the page lock from this system and replace it with
cmpxchg loops. With per-page granularity I expect the number of retries
to be small, and I added a counter for them. I added a 32-bit
vm_page_astate_t which contains aflags (widened to 16 bits), the queue
index and act_count. Code which uses aflags not related to page queue
state can use vm_page_aflag_{set,clear}() as before. The change
introduces vm_page_astate_fcmpset(), which behaves as you'd expect.

The common pattern is to use vm_page_astate_load() to atomically load
queue state from a page into a stack variable, perform whatever checks
are needed and possibly abort, copy that state and update as needed, and
call vm_page_pqstate_commit() with the old and new copies.
vm_page_pqstate_commit() attempts to apply the update and also takes
care of physically dequeuing a page in preparation for moving it to
another queue. In particular, in operations which transfer a page
from one queue to another we still must dequeue the page in-place before
creating a batched queue operation to enqueue the page.

In some ways this approach is actually simpler than the old one.
Previously, the aflags and queue fields were updated independently, so
it was necessary to handle inconsistencies. (See the old versions of
vm_page_queue() and vm_page_dequeue_complete() for example, or compare
vm_page_dequeue() with and without this patch). Now, since the aflags
and queue fields are updated atomically, we can get a snapshot of a
page's queue state with vm_page_astate_load(). This also makes it easier
to write assertions.

Details:
There are several scenarios where we perform queue operations, with
different semantics. vm_page_activate(), vm_page_deactivate(),
vm_page_deactivate_noreuse() and vm_page_launder() attempt to enqueue
the page in the corresponding queue. They bail if PGA_DEQUEUE is set
(more on that later). vm_page_activate() just ensures that
act_count >= ACT_INIT if the page is already in PQ_ACTIVE; the others
requeue the page so as to reflect a reference. All of these functions
use vm_page_mvqueue() to perform the queue state update.

Unwiring a page also updates queue state. There are two flavours:
vm_page_unwire(), where the queue index is specified by the caller, and
vm_page_release() and vm_page_release_locked(), which usually put the
page in PQ_INACTIVE. If these functions find that the page is in
PQ_ACTIVE, they set PGA_REFERENCED and leave the page alone. Otherwise
they put it in PQ_INACTIVE. This happens in vm_page_unwire_managed(),
which calls vm_page_release_toq() to perform the queue state update.
vm_page_release_toq() and vm_page_mvqueue() are pretty similar, but
different enough that I think it makes sense to keep them separate.

With this change, wiring also updates queue state. Now that both
(un)wiring and queue state updates are performed without the page lock,
we need to do more work to maintain the lazy dequeue semantics of page
wiring. To this end, vm_page_wire() and vm_page_wire_mapped()
unconditionally set PGA_DEQUEUE. As a result, during page queue scans,
the page daemon can simply check for PGA_DEQUEUE instead of both
PGA_DEQUEUE and wirings before acquiring the object lock.

The page queue scans update act_count and thus also use
vm_page_pqstate_commit().

Test Plan

will-it-scale shows a small (1-2%) improvement in throughput with
single-threaded faults on anonymous memory.

mjg reported that the pwrite1 benchmark on tmpfs gets 50% more
op/s with this patch applied. uiomove_object_page() previously
had to acquire the page lock even in the common case where the
page is already in the active queue and doesn't require and queue
state updates.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 26448
Build 24881: arc lint + arc unit

Event Timeline

markj created this revision.Sep 11 2019, 11:57 PM

Herald added a subscriber: delphij. · View Herald TranscriptSep 11 2019, 11:57 PM

Harbormaster completed remote builds in B26411: Diff 61958.Sep 11 2019, 11:57 PM

markj edited the test plan for this revision. (Show Details)Sep 12 2019, 12:02 AM

markj added a reviewer: jeff.

markj edited the summary of this revision. (Show Details)Sep 12 2019, 12:08 AM

jeff added inline comments.Sep 12 2019, 2:24 AM

sys/amd64/include/pmap.h
427	Is #define aflags astate.flags too intrusive? I might give this a shorter name if so. a.flags Otherwise vm_page_aflags().
sys/kern/subr_pidctrl.c
144 ↗	(On Diff #61958)	Is this just whitespace?
sys/vm/vm_object.c
2317	vm_page_queue()?
sys/vm/vm_page.c
2577	In general I have to write these clauses as wired tests first busy second because I convert them to trybusy() so I need to know if they ever succeeded.
3120	Maybe break this comment up into stages as you deal with the separate races.
sys/vm/vm_pageout.c
904	trailing ;
1229–1247	I feel like this might want to be a dedicated function since so much is rewritten. vm_pageout_active_page() vm_pageout_inactive_page() As we have simplified the control flow perhaps this is possible now without a bunch of pointer arguments.

jeff added inline comments.Sep 12 2019, 4:00 AM

sys/vm/vm_swapout.c
186–199	This chunk is perhaps repeated in vm_pageout_scan_active and _inactive with the difference there that we check for zero reference counts and obj dead which presumably we know not to be true here. I also wonder if these loops could be written as old = astate = vm_page_astate_load() with the resulting iterators always operating on 'astate' rather than a mix of old/new. I think the old/new paradigm is somewhat confusing.

jeff added inline comments.Sep 12 2019, 6:58 PM

sys/vm/vm_page.c
3864–3865	Should this check just be at the top of pqstate_commit()?

markj marked 4 inline comments as done.Sep 12 2019, 10:13 PM

markj added inline comments.

sys/amd64/include/pmap.h
427	Yeah, even UMA uses "aflags" as an identifier. I'm kind of torn. I tend to prefer wrappers since they make refactoring easier. I agree that I should be using vm_page_aflags() here though.
sys/kern/subr_pidctrl.c
144 ↗	(On Diff #61958)	Yeah, not sure how that got in here.
sys/vm/vm_object.c
2317	Yes. It was written this way before since vm_page_queue() used to be synchronized with by the page lock.
sys/vm/vm_page.c
2577	Hmmm. Note that the wire count is not stable while another thread has the page busied though.
3864–3865	Yes, I think so.
sys/vm/vm_pageout.c
904	Actually it is required here since no statement follows the label.
1229–1247	I was thinking about doing that too. It'd require a separate function for each queue scan though, since the logic is somewhat different in each case. Maybe inactive and laundry can be combined.
sys/vm/vm_swapout.c
186–199	I think this whole function is strange. It effectively duplicates parts of the active queue scan, but with seemingly gratuitous differences. I think it could instead use pmap_is_referenced() to non-destructively check for references, and move the page to the appropriate queue (and unmap it) as appropriate. Then we could just use vm_page_activate() and related functions, and there would be no direct manipulation of queue state. I was planning to do that as a separate diff for head, and then rebase this diff on top.