Details

Reviewers

jeff
alc
kib

Commits

rS332974: Improve VM page queue scalability.

Summary

The change is somewhat large, sorry. I will try to explain the gist of it,
and the provide a list of things which have changed.

The aim is to reduce contention on page queue locks. Right now, both
the page and page queue locks need to be held to enqueue, requeue or
dequeue a page. Of course, this is not very scalable, and it exacerbates
page lock contention because page locks are first in the lock order.
Consider that we hold the queue lock for the entirety of a PQ_ACTIVE
scan. If a thread attempts to enqueue a page there, it will block with
a page lock held until the scan is complete.

The approach here is to separate queue operations (enqueue, dequeue
and requeue) into two phases. The first phase requires only the page lock,
and schedules a deferred queue operation using per-CPU batch queues.
There is one batch queue per CPU per page queue. Operations are encoded
using atomic flags in the page. The second phase, implemented in
vm_pqbatch_process(), processes a batch queue with the page queue lock
held and carries out the requested queue operations. The second phase is
performed only when the batch is full, so operations on a given page
may be deferred indefinitely.

vm_page_enqueue() and vm_page_requeue() always perform deferred
operations. Higher-level APIs (e.g., vm_page_deactivate()) thus perform
deferred queue operations as well. vm_page_dequeue() guarantees that
the page is dequeued before the function returns, and
vm_page_dequeue_deferred() performs a deferred dequeue. vm_page_dequeue()
requires both the page and page queue locks unless a deferred dequeue was
already requested for the page, in which case only the queue lock is required.

The locking protocol for the queue field of struct vm_page is changed.
The field is only allowed to transition between PQ_NONE and a queue index,
i.e., it cannot transition directly between queue indices. To update the field, the
lock for the from-value must be held. For PQ_NONE this is the page lock,
otherwise it is the corresponding page queue lock. There is one place where
we safely violate this rule for an optimization: in the inactive queue scan,
right before freeing the page. There, we set the field to PQ_NONE directly
with the page lock held. At that point, it is known that the page is physically
removed from the queue and that no queue operations are scheduled, so
the queue lock is not needed in order to complete removal of the page.

Changes:

vm_phys uses the listq field for freelists rather than plinks.q. We now permit freed pages to reside on page queues. Such pages must be schedule for a deferred dequeue. The page allocators complete the dequeue before returning the page.
The page daemon scan loops are substantially different. The idea now is to quickly collect a batch of pages with only the page queue lock held, and then process that batch without touching the page queue lock. This lets us get rid of some of the dancing that must occur to acquire the page and object locks with the page queue lock held.
When collecting a batch during the PQ_INACTIVE scan, pages in the batch are dequeued, in the anticipation that most of them will be freed. For PQ_ACTIVE and PQ_LAUNDRY scans, we keep pages on the queue: during a PQ_ACTIVE scan, we end up requeuing most pages, and during a PQ_LAUNDRY scan, we keep pages queued until laundering is done.
The lock dancing in vm_object_terminate_pages() is gone. vm_page_free_prep() schedules a deferred dequeue for the page, so the dequeue operations are already batched. Similarly, now that we use a UMA cache for FREEPOOL_DEFAULT pages, most calls to vm_page_free() do not acquire the free queue lock.
The PQ_ACTIVE scan is implemented using the CLOCK algorithm. This is to avoid requeue operations during the scan.
vm_page_deactivate_noreuse() uses a separate set of per-CPU batch queues to implement insertions near the head of the queue. I'm open to suggestions on other ways to implement this.

Test Plan

pho has done some testing of this change.

In my own testing, this largely eliminates page queue locks
as a source of contention in pgbench tests and during builds.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 15921
Build 15913: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Harbormaster completed remote builds in B15926: Diff 40974.Apr 1 2018, 9:15 PM

Minor cleanups.

Harbormaster completed remote builds in B15945: Diff 40999.Apr 2 2018, 7:55 PM

Remove unintended diff with head.

Harbormaster completed remote builds in B15946: Diff 41000.Apr 2 2018, 7:57 PM

jeff added inline comments.Apr 3 2018, 10:42 PM

sys/amd64/include/vmparam.h
234	I experimented with this fyi. I didn't see much change at 15. I didn't try 63. I think this is not so large as to be a problem though. Linux uses 15 for their free queue and I think that is too small.
sys/vm/vm_page.c
137	What is the reason for not using just another bit in the aflags to indicate head vs tail insertion?
sys/vm/vm_pageout.c
1264–1265	All of this feels like it belongs next to batchq_process in vm_page.c
1319–1323	this looks super nice now. The refactoring is great.
sys/vm/vm_pagequeue.h
122	We should probably put a short description of the operation of the clock algorithm in comments somewhere. Perhaps just above the scan in the active queue.
214	Have you thought about any negative consequences involved in processing in the opposite order of insertion? Specifically, things will be inserted from a buf in pindex order and removed in reverse pindex order. It may be well tolerated.
sys/vm/vm_reserv.c
422 ↗	(On Diff #41000)	Looks like this is an unintentional revert?
sys/vm/vnode_pager.c
773 ↗	(On Diff #41000)	We should split this off into a different patch.

markj marked an inline comment as done.Apr 4 2018, 4:08 PM

markj added inline comments.

sys/vm/vm_page.c
137	It was mostly because we only have two atomic bits left. Another solution would be to have a separate inactive pagequeue for pages that are bypassing global LRU. Because pq->pq_mtx is a pointer, both queues could be protected by the same lock, making transfers between the two queues relatively efficient. What do you think?
sys/vm/vm_pagequeue.h
214	I couldn't think of any negative consequences. Of course, it'd be straightforward to process pages in the "right" order, we'd just an iterator.
sys/vm/vm_reserv.c
422 ↗	(On Diff #41000)	Hm, not sure what happened. I'll try to fix it in the next upload.
sys/vm/vnode_pager.c
773 ↗	(On Diff #41000)	Will do.

More fixes and cleanup

Harbormaster completed remote builds in B15972: Diff 41088.Apr 4 2018, 4:08 PM

markj marked 2 inline comments as done.Apr 4 2018, 4:10 PM

markj edited the summary of this revision. (Show Details)Apr 4 2018, 4:52 PM

markj edited the test plan for this revision. (Show Details)

markj added reviewers: alc, kib.

jeff accepted this revision.Apr 4 2018, 6:05 PM

This revision is now accepted and ready to land.Apr 4 2018, 6:05 PM

glebius added inline comments.Apr 4 2018, 7:24 PM

sys/vm/vm_page.c
3265	What if after critical_exit we have changed the CPU? In this case proceeding further to the end of the function doesn't seem optimal. I'd suggest just to put a page on regular queue, since we already got lock.
3269	Can you please explain the need for the second call to vm_pqbatch_process?

markj added inline comments.Apr 4 2018, 7:29 PM

sys/vm/vm_page.c
3265	Why not? We now hold the page queue lock, so we might as well drain the current queue.
3269	It processes solely the page m. We're at this point because the queue is full, so we have to drain it first by calling `vm_pqbatch_process()`. Probably there should be a `vm_pqbatch_process_page()` which processes a single page, and we just call that on `m`.

Small cleanups. Add vm_pqbatch_process_page().

This revision now requires review to proceed.Apr 5 2018, 9:55 PM

Harbormaster completed remote builds in B16001: Diff 41154.Apr 5 2018, 9:55 PM

markj marked an inline comment as done.Apr 5 2018, 9:55 PM

kib added inline comments.Apr 7 2018, 2:01 PM

sys/vm/vm_object.c
788	This comment is now rather misleading, it seems.
sys/vm/vm_page.c
137	Why not put this into static pcpu ?
2354–2355	Did you considered creating macros to detect logical enqueue conditions ? Like logically non-queued == (queue != PG_NONE && PGA_DEQUEUE == 0) ?
3253–3268	Don't you need the release fence between setting page domain and setting PGA_REQUEUE ?
sys/vm/vm_page.h
364	You could use only two bits for this, instead of three. You need to encode four states.

kib added inline comments.Apr 7 2018, 2:23 PM

sys/vm/vm_page.c
3191–3192	I think you need acquire fence before reading aflags there.

markj added inline comments.Apr 7 2018, 3:54 PM

sys/vm/vm_object.c
748–749	I don't think we need to clear PG_ZERO anymore.
788	Do you mean that we are now waiting for the laundry thread?
sys/vm/vm_page.c
137	Is there any downside to using the dynamic per-CPU section? I like the fact that the definition is visible only in vm_page.c.
2354–2355	Yes, I should do that. I think this is missing an acquire barrier too.
3191–3192	I am not sure about this. Right now, we are using barriers to synchronize the following operations: T1 dequeues a page with the queue lock held: set m->queue = PQ_NONE, release barrier, clear PGA_DEQUEUE in aflags T2 checks to see if the page is dequeued with page lock held: check for PGA_DEQUEUE in aflags, acquire barrier, check m->queue == PQ_NONE (I don't know what kind of notation is best here.) Here we hold the queue lock, so m->queue is stable, and any thread which updated it must have implicitly issued a release barrier when it unlocked the queue. So I don't see why we need an acquire fence.
3253–3268	See above. We should already be synchronized by the page lock.
sys/vm/vm_page.h
364	Hm, I think it must be more than four? A page may be enqueued or not, and in either case may have a pending requeue or dequeue, or no operation pending.

kib added inline comments.Apr 7 2018, 4:31 PM

sys/vm/vm_object.c
788	I mean that pageout might not yet handled all delayed batches. pip only means that io finished, and in fact pip > 0 means 'do not terminate', not necessary requested by pagedaemon.
sys/vm/vm_page.c
137	It is one more arithmetic op on pointers to get to dpcu from pcpu. But my point is that the batch queues are not dynamic, they exists for the whole duration of the system operation. I do not insist either way.
3191–3192	Ok, thank you for the explanation.

Let vm_page_free() clear PG_ZERO.
Add vm_page_enqueued().

Harbormaster completed remote builds in B16067: Diff 41280.Apr 8 2018, 7:35 PM

markj marked 2 inline comments as done.Apr 8 2018, 7:36 PM

markj added inline comments.

sys/vm/vm_object.c
788	I think the batching is orthogonal to this. When collecting a batch of pages from the laundry queue for further processing, the pages are not dequeued, and we check for OBJ_DEAD before starting to page out. If OBJ_DEAD is set, nothing further is done with the page, same as before.

Implement noreuse with an extra atomic bit.

This removes the extra noreuse batch queues. Now we have only one
free atomic bit left. Because PGA_ENQUEUED is protected by the page
queue lock, we could probably implement it in the queue field and
recover an atomic bit. It would also be possible to represent the
queue requests {none, dequeue, requeue, requeue_head} using only
two bits, though it would be more expensive to do so.

Harbormaster completed remote builds in B16072: Diff 41288.Apr 9 2018, 1:09 PM

Address some code duplication. Use a common subroutine, vm_pqbatch_submit_page(), to enqueue a page into a batch queue.

Harbormaster completed remote builds in B16073: Diff 41289.Apr 9 2018, 1:44 PM

swills added a subscriber: swills.Apr 9 2018, 1:54 PM

markj marked an inline comment as done.Apr 9 2018, 3:07 PM

kib added inline comments.Apr 12 2018, 4:28 PM

sys/vm/vm_page.c
3130	Did you considered copying the pcpu pqbatch into a local array for processing and enabling preemption immediately after that ? I think that on full pqbatch, we spend a time which is enough to adversely affect the system responsivness.
sys/vm/vm_page.h
96	Since you are editing this, pool lock for the page is what we call page lock now.
152	... (P) for the page if not enqueued. Is this correct ?
sys/vm/vm_pagequeue.h
206	Add CRITICAL_ASSERT() there ?

markj added inline comments.Apr 12 2018, 5:34 PM

sys/vm/vm_page.c
3130	I did, but there is a non-trivial stack space cost (256 bytes on LP64) in addition to that of the copy. The cost of processing 32 pages (8 on !amd64) ought to be quite low. I ran Jeff's dd benchmark (where parallel dd processes read from a sparse file and thus frequently enqueue pages) and used DTrace to count the number of preemptions that occurred in vm_page_enqueue(); excluding preemptions of the idle threads, less than 0.1% of preemptions occurred during the critical_exit() here. This is also an extreme case and in pretty much any workload we will not be performing queue operations at nearly that high a rate. For example, when running a 16-thread pgbench select-only benchmark on an SSD array with about 1GB/s throughput, the number of preemptions that occur here is effectively 0. Preemptions in UMA are much more common.
sys/vm/vm_page.h
96	Ok, will fix.
152	That's correct. I'll amend that sentence.
sys/vm/vm_pagequeue.h
206	Not all batch queues are per-CPU. The page daemon uses them to collect pages, and in that case the assertion would be invalid.

kib accepted this revision.Apr 12 2018, 5:45 PM

kib added inline comments.

sys/vm/vm_pagequeue.h
206	Please explain this as a comment as well.

This revision is now accepted and ready to land.Apr 12 2018, 5:45 PM

Amend a comment slightly.
Remove bogus assertion.
Stop the laundry scan once the target is reached.
Drain per-CPU batch queues from the swapout daemon.
Follow a name convention used elsewhere in batch processing.

This revision now requires review to proceed.Apr 13 2018, 6:54 PM

Harbormaster completed remote builds in B16129: Diff 41445.Apr 13 2018, 6:54 PM

markj marked 2 inline comments as done.Apr 13 2018, 6:58 PM

markj added inline comments.

sys/vm/vm_page.h
96	Err, sorry, do you mean that this should be updated to "page lock"?

The vm_page_drain_pgbatch() looks good, thank you.

sys/vm/vm_page.h
96	Most likely yes, but also that the page lock is actually pooled.

Clarify a couple of comments in vm_page.h.

Harbormaster completed remote builds in B16130: Diff 41448.Apr 13 2018, 8:37 PM

Rebase.
Adjust the PQ_ACTIVE scan target after the clock hands meet.

Harbormaster completed remote builds in B16256: Diff 41728.Apr 22 2018, 4:45 PM

This revision was not accepted when it landed; it landed in state Needs Review.Apr 24 2018, 9:16 PM

Closed by commit rS332974: Improve VM page queue scalability. (authored by markj). · Explain Why

This revision was automatically updated to reflect the committed changes.

Herald added a subscriber: imp. · View Herald TranscriptApr 24 2018, 9:16 PM

VM page queue batching
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 40967

sys/amd64/include/vmparam.h

sys/kern/subr_witness.c

sys/vm/vm_object.c

sys/vm/vm_page.h

sys/vm/vm_page.c

sys/vm/vm_pageout.c

sys/vm/vm_pagequeue.h

sys/vm/vm_phys.c

VM page queue batchingClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 40967

sys/amd64/include/vmparam.h

sys/kern/subr_witness.c

sys/vm/vm_object.c

sys/vm/vm_page.h

sys/vm/vm_page.c

sys/vm/vm_pageout.c

sys/vm/vm_pagequeue.h

sys/vm/vm_phys.c

VM page queue batching
ClosedPublic
Actions

Revision Contents
Changeset List