Page MenuHomeFreeBSD

VM page queue batching
ClosedPublic

Authored by markj on Mar 29 2018, 2:47 PM.
Tags
None
Referenced Files
F133579886: D14893.id41728.diff
Sun, Oct 26, 7:21 PM
F133504028: D14893.id41448.diff
Sun, Oct 26, 6:36 AM
F133481199: D14893.id41088.diff
Sun, Oct 26, 2:54 AM
Unknown Object (File)
Sat, Oct 25, 7:41 AM
Unknown Object (File)
Sat, Oct 25, 7:39 AM
Unknown Object (File)
Thu, Oct 23, 5:53 AM
Unknown Object (File)
Wed, Oct 22, 3:12 PM
Unknown Object (File)
Wed, Oct 22, 7:17 AM
Subscribers

Details

Summary

The change is somewhat large, sorry. I will try to explain the gist of it,
and the provide a list of things which have changed.

The aim is to reduce contention on page queue locks. Right now, both
the page and page queue locks need to be held to enqueue, requeue or
dequeue a page. Of course, this is not very scalable, and it exacerbates
page lock contention because page locks are first in the lock order.
Consider that we hold the queue lock for the entirety of a PQ_ACTIVE
scan. If a thread attempts to enqueue a page there, it will block with
a page lock held until the scan is complete.

The approach here is to separate queue operations (enqueue, dequeue
and requeue) into two phases. The first phase requires only the page lock,
and schedules a deferred queue operation using per-CPU batch queues.
There is one batch queue per CPU per page queue. Operations are encoded
using atomic flags in the page. The second phase, implemented in
vm_pqbatch_process(), processes a batch queue with the page queue lock
held and carries out the requested queue operations. The second phase is
performed only when the batch is full, so operations on a given page
may be deferred indefinitely.

vm_page_enqueue() and vm_page_requeue() always perform deferred
operations. Higher-level APIs (e.g., vm_page_deactivate()) thus perform
deferred queue operations as well. vm_page_dequeue() guarantees that
the page is dequeued before the function returns, and
vm_page_dequeue_deferred() performs a deferred dequeue. vm_page_dequeue()
requires both the page and page queue locks unless a deferred dequeue was
already requested for the page, in which case only the queue lock is required.

The locking protocol for the queue field of struct vm_page is changed.
The field is only allowed to transition between PQ_NONE and a queue index,
i.e., it cannot transition directly between queue indices. To update the field, the
lock for the from-value must be held. For PQ_NONE this is the page lock,
otherwise it is the corresponding page queue lock. There is one place where
we safely violate this rule for an optimization: in the inactive queue scan,
right before freeing the page. There, we set the field to PQ_NONE directly
with the page lock held. At that point, it is known that the page is physically
removed from the queue and that no queue operations are scheduled, so
the queue lock is not needed in order to complete removal of the page.

Changes:

  • vm_phys uses the listq field for freelists rather than plinks.q. We now permit freed pages to reside on page queues. Such pages must be schedule for a deferred dequeue. The page allocators complete the dequeue before returning the page.
  • The page daemon scan loops are substantially different. The idea now is to quickly collect a batch of pages with only the page queue lock held, and then process that batch without touching the page queue lock. This lets us get rid of some of the dancing that must occur to acquire the page and object locks with the page queue lock held.
  • When collecting a batch during the PQ_INACTIVE scan, pages in the batch are dequeued, in the anticipation that most of them will be freed. For PQ_ACTIVE and PQ_LAUNDRY scans, we keep pages on the queue: during a PQ_ACTIVE scan, we end up requeuing most pages, and during a PQ_LAUNDRY scan, we keep pages queued until laundering is done.
  • The lock dancing in vm_object_terminate_pages() is gone. vm_page_free_prep() schedules a deferred dequeue for the page, so the dequeue operations are already batched. Similarly, now that we use a UMA cache for FREEPOOL_DEFAULT pages, most calls to vm_page_free() do not acquire the free queue lock.
  • The PQ_ACTIVE scan is implemented using the CLOCK algorithm. This is to avoid requeue operations during the scan.
  • vm_page_deactivate_noreuse() uses a separate set of per-CPU batch queues to implement insertions near the head of the queue. I'm open to suggestions on other ways to implement this.
Test Plan

pho has done some testing of this change.

In my own testing, this largely eliminates page queue locks
as a source of contention in pgbench tests and during builds.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 16001
Build 15983: arc lint + arc unit

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
markj marked 2 inline comments as done.
  • Minor cleanups.
  • Remove unintended diff with head.
sys/amd64/include/vmparam.h
234

I experimented with this fyi. I didn't see much change at 15. I didn't try 63. I think this is not so large as to be a problem though.

Linux uses 15 for their free queue and I think that is too small.

sys/vm/vm_page.c
137

What is the reason for not using just another bit in the aflags to indicate head vs tail insertion?

sys/vm/vm_pageout.c
1264–1265

All of this feels like it belongs next to batchq_process in vm_page.c

1305–1309

this looks super nice now. The refactoring is great.

sys/vm/vm_pagequeue.h
121

We should probably put a short description of the operation of the clock algorithm in comments somewhere. Perhaps just above the scan in the active queue.

213

Have you thought about any negative consequences involved in processing in the opposite order of insertion? Specifically, things will be inserted from a buf in pindex order and removed in reverse pindex order. It may be well tolerated.

sys/vm/vm_reserv.c
422 ↗(On Diff #41000)

Looks like this is an unintentional revert?

sys/vm/vnode_pager.c
773 ↗(On Diff #41000)

We should split this off into a different patch.

markj marked an inline comment as done.Apr 4 2018, 4:08 PM
markj added inline comments.
sys/vm/vm_page.c
137

It was mostly because we only have two atomic bits left.

Another solution would be to have a separate inactive pagequeue for pages that are bypassing global LRU. Because pq->pq_mtx is a pointer, both queues could be protected by the same lock, making transfers between the two queues relatively efficient. What do you think?

sys/vm/vm_pagequeue.h
213

I couldn't think of any negative consequences. Of course, it'd be straightforward to process pages in the "right" order, we'd just an iterator.

sys/vm/vm_reserv.c
422 ↗(On Diff #41000)

Hm, not sure what happened. I'll try to fix it in the next upload.

sys/vm/vnode_pager.c
773 ↗(On Diff #41000)

Will do.

  • More fixes and cleanup
markj marked 2 inline comments as done.Apr 4 2018, 4:10 PM
markj edited the test plan for this revision. (Show Details)
markj added reviewers: alc, kib.
This revision is now accepted and ready to land.Apr 4 2018, 6:05 PM
sys/vm/vm_page.c
3279

What if after critical_exit we have changed the CPU? In this case proceeding further to the end of the function doesn't seem optimal. I'd suggest just to put a page on regular queue, since we already got lock.

3283

Can you please explain the need for the second call to vm_pqbatch_process?

sys/vm/vm_page.c
3279

Why not? We now hold the page queue lock, so we might as well drain the current queue.

3283

It processes solely the page m. We're at this point because the queue is full, so we have to drain it first by calling vm_pqbatch_process().

Probably there should be a vm_pqbatch_process_page() which processes a single page, and we just call that on m.

  • Small cleanups. Add vm_pqbatch_process_page().
This revision now requires review to proceed.Apr 5 2018, 9:55 PM
markj marked an inline comment as done.Apr 5 2018, 9:55 PM
sys/vm/vm_object.c
787

This comment is now rather misleading, it seems.

sys/vm/vm_page.c
137

Why not put this into static pcpu ?

2358

Did you considered creating macros to detect logical enqueue conditions ? Like logically non-queued == (queue != PG_NONE && PGA_DEQUEUE == 0) ?

3267

Don't you need the release fence between setting page domain and setting PGA_REQUEUE ?

sys/vm/vm_page.h
364

You could use only two bits for this, instead of three. You need to encode four states.

sys/vm/vm_page.c
3198

I think you need acquire fence before reading aflags there.

sys/vm/vm_object.c
747

I don't think we need to clear PG_ZERO anymore.

787

Do you mean that we are now waiting for the laundry thread?

sys/vm/vm_page.c
137

Is there any downside to using the dynamic per-CPU section? I like the fact that the definition is visible only in vm_page.c.

2358

Yes, I should do that. I think this is missing an acquire barrier too.

3198

I am not sure about this. Right now, we are using barriers to synchronize the following operations:

T1 dequeues a page with the queue lock held: set m->queue = PQ_NONE, release barrier, clear PGA_DEQUEUE in aflags

T2 checks to see if the page is dequeued with page lock held: check for PGA_DEQUEUE in aflags, acquire barrier, check m->queue == PQ_NONE

(I don't know what kind of notation is best here.) Here we hold the queue lock, so m->queue is stable, and any thread which updated it must have implicitly issued a release barrier when it unlocked the queue. So I don't see why we need an acquire fence.

3267

See above. We should already be synchronized by the page lock.

sys/vm/vm_page.h
364

Hm, I think it must be more than four? A page may be enqueued or not, and in either case may have a pending requeue or dequeue, or no operation pending.

sys/vm/vm_object.c
787

I mean that pageout might not yet handled all delayed batches. pip only means that io finished, and in fact pip > 0 means 'do not terminate', not necessary requested by pagedaemon.

sys/vm/vm_page.c
137

It is one more arithmetic op on pointers to get to dpcu from pcpu. But my point is that the batch queues are not dynamic, they exists for the whole duration of the system operation. I do not insist either way.

3198

Ok, thank you for the explanation.

markj marked an inline comment as done.
  • Let vm_page_free() clear PG_ZERO.
  • Add vm_page_enqueued().
markj marked 2 inline comments as done.Apr 8 2018, 7:36 PM
markj added inline comments.
sys/vm/vm_object.c
787

I think the batching is orthogonal to this. When collecting a batch of pages from the laundry queue for further processing, the pages are not dequeued, and we check for OBJ_DEAD before starting to page out. If OBJ_DEAD is set, nothing further is done with the page, same as before.

  • Implement noreuse with an extra atomic bit.

This removes the extra noreuse batch queues. Now we have only one
free atomic bit left. Because PGA_ENQUEUED is protected by the page
queue lock, we could probably implement it in the queue field and
recover an atomic bit. It would also be possible to represent the
queue requests {none, dequeue, requeue, requeue_head} using only
two bits, though it would be more expensive to do so.

  • Address some code duplication. Use a common subroutine, vm_pqbatch_submit_page(), to enqueue a page into a batch queue.
markj marked an inline comment as done.Apr 9 2018, 3:07 PM
sys/vm/vm_page.c
3134

Did you considered copying the pcpu pqbatch into a local array for processing and enabling preemption immediately after that ? I think that on full pqbatch, we spend a time which is enough to adversely affect the system responsivness.

sys/vm/vm_page.h
96

Since you are editing this, pool lock for the page is what we call page lock now.

152

... (P) for the page if not enqueued.

Is this correct ?

sys/vm/vm_pagequeue.h
205

Add CRITICAL_ASSERT() there ?

sys/vm/vm_page.c
3134

I did, but there is a non-trivial stack space cost (256 bytes on LP64) in addition to that of the copy. The cost of processing 32 pages (8 on !amd64) ought to be quite low. I ran Jeff's dd benchmark (where parallel dd processes read from a sparse file and thus frequently enqueue pages) and used DTrace to count the number of preemptions that occurred in vm_page_enqueue(); excluding preemptions of the idle threads, less than 0.1% of preemptions occurred during the critical_exit() here. This is also an extreme case and in pretty much any workload we will not be performing queue operations at nearly that high a rate. For example, when running a 16-thread pgbench select-only benchmark on an SSD array with about 1GB/s throughput, the number of preemptions that occur here is effectively 0. Preemptions in UMA are much more common.

sys/vm/vm_page.h
96

Ok, will fix.

152

That's correct. I'll amend that sentence.

sys/vm/vm_pagequeue.h
205

Not all batch queues are per-CPU. The page daemon uses them to collect pages, and in that case the assertion would be invalid.

kib added inline comments.
sys/vm/vm_pagequeue.h
205

Please explain this as a comment as well.

This revision is now accepted and ready to land.Apr 12 2018, 5:45 PM
  • Amend a comment slightly.
  • Remove bogus assertion.
  • Stop the laundry scan once the target is reached.
  • Drain per-CPU batch queues from the swapout daemon.
  • Follow a name convention used elsewhere in batch processing.
This revision now requires review to proceed.Apr 13 2018, 6:54 PM
markj added inline comments.
sys/vm/vm_page.h
96

Err, sorry, do you mean that this should be updated to "page lock"?

The vm_page_drain_pgbatch() looks good, thank you.

sys/vm/vm_page.h
96

Most likely yes, but also that the page lock is actually pooled.

  • Clarify a couple of comments in vm_page.h.
  • Rebase.
  • Adjust the PQ_ACTIVE scan target after the clock hands meet.
This revision was not accepted when it landed; it landed in state Needs Review.Apr 24 2018, 9:16 PM
This revision was automatically updated to reflect the committed changes.