Add support for multithreading the inactive queue pageout within a domain.
ClosedPublic
Actions

Authored by cem on Sep 12 2019, 11:22 PM.

Details

Reviewers

alc
kib
markj
dougm
jeff

Commits

rS364129: Add support for multithreading the inactive queue pageout within a domain.

Summary

In very high throughput workloads the inactive scan can become overwhelmed as you have many cores producing pages and a single core freeing. Since Mark's introduction of batched pagequeue operations we can now run multiple inactive threads working on independent batches.

To avoid confusing the pid and other control algorithms I do this in a mpi-like fan out and collect model that is driven from the primary page daemon. It decides whether the shortfall can be overcome with a single thread and if not dispatches multiple threads and waits for their results.

The heuristic is based on timing the pageout activity and averaging a pages-per-second variable which is exponentially decayed. This is visible in sysctl and may be interesting for other purposes.

I have verified that this does indeed double our paging throughput when used with two threads. With four we tend to run into other contention problems. For now I would like to commit this infrastructure with only a single thread enabled.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jeff created this revision.Sep 12 2019, 11:22 PM

Harbormaster completed remote builds in B26451: Diff 62014.Sep 12 2019, 11:22 PM

jeff edited the summary of this revision. (Show Details)Sep 12 2019, 11:27 PM

jeff added reviewers: alc, kib, markj, dougm.

markj added inline comments.Sep 13 2019, 12:03 AM

sys/vm/vm_pageout.c
1675 ↗	(On Diff #62014)	You had mentioned using the PID controller's integral term to determine whether the page daemon had fallen behind, in particular checking whether it had reached its bound. Does that heuristic not end up working well?
2177 ↗	(On Diff #62014)	style(9) prefers "for (;;)". At least, we should just use the C99 bool literals instead.
2182 ↗	(On Diff #62014)	Why not just use refcount_release()?
2328 ↗	(On Diff #62014)	Maybe "dom%d helper"?

jeff added inline comments.Sep 13 2019, 12:06 AM

sys/vm/vm_pageout.c
1675 ↗	(On Diff #62014)	This gives you a faster response. You don't have to wait for the error to accumulate. I think this is simpler.
2182 ↗	(On Diff #62014)	If you have spurious wakeups you would drop the ref below zero. This makes sure only as many threads are running as requested. This could be done with a semaphore but our implementation is ugly.

markj added inline comments.Sep 13 2019, 2:58 PM

sys/vm/vm_pageout.c
2182 ↗	(On Diff #62014)	Just to reiterate a discussion on slack: I don't think spurious wakeups are something we need to handle when the sleep is protected by a lock, as it is in this case. The sleepqueue code only wakes up threads sleeping on the wchan. All accesses to vmd_inactive_starting are protected by vmd_inactive_starting, so we do not even need to use refcount_* to manipulate it.

bdrewery added a subscriber: bdrewery.Mar 11 2020, 12:42 AM

vangyzen added a subscriber: vangyzen.Jul 31 2020, 8:24 PM

We are running into pagedaemon bottlenecking on a high-CPU count system under load; most of the time seems to be spent in inactive scanning. Are you still working on this patch, Jeff?

I'll go ahead and take this one, until/unless I hear otherwise. I've rebased the patch onto recent git master, adapted to the blockcount(9) API. I left default PAGEOUT_THREADS 1, but for testing purposes have options PAGEOUT_THREADS=4 in my amd64/conf/GENERIC.

Rebase on git master
Use blockcount(9) API rather than refcount for started/running thread counts
Minimal boot testing: unloaded virtual machine boots kernel ok. haven't tried under load yet.

Harbormaster completed remote builds in B32789: Diff 75466.Aug 5 2020, 11:29 PM

cem edited subscribers, added: pho; removed: cem.Aug 5 2020, 11:44 PM

FWIW, since yesterday we've run this under some Isilon workloads and it does seem to (1) function correctly and (2) alleviate load on the pagedaemon thread(s) (which were previously a bottleneck on some workloads, running at 100% CPU constantly).

In D21629#576019, @cem wrote:

FWIW, since yesterday we've run this under some Isilon workloads and it does seem to (1) function correctly and (2) alleviate load on the pagedaemon thread(s) (which were previously a bottleneck on some workloads, running at 100% CPU constantly).

I have been testing this for ~9 hours. No problems seen.

gbe added a subscriber: gbe.Aug 7 2020, 8:27 AM

Does anyone object to this going in, or would have some time to review this if we waited a few more days?

With PAGEOUT_THREADS 1 as the default, there is little functional change. There are a handful of additional uncontested atomics and vm_pageout_inactive_dispatch wraps vm_pageout_scan_inactive, but it's largely the same code.

It might be nice to make this a tuneable instead of an option; there's no real reason the number cannot be tuned once at boot time. It could even be runtime-adjustable, although that is much more complicated.

In D21629#577169, @cem wrote:

Does anyone object to this going in, or would have some time to review this if we waited a few more days?

With PAGEOUT_THREADS 1 as the default, there is little functional change. There are a handful of additional uncontested atomics and vm_pageout_inactive_dispatch wraps vm_pageout_scan_inactive, but it's largely the same code.

It might be nice to make this a tuneable instead of an option; there's no real reason the number cannot be tuned once at boot time. It could even be runtime-adjustable, although that is much more complicated.

I have no objection. I agree that it should be a tuneable; I thought I had suggested that at some point, apparently not.

sys/vm/vm_param.h
137 ↗	(On Diff #75466)	I would add a comment indicating that it's really PAGEOUT_THREADS_PER_DOMAIN, or rename the constant accordingly. Or just make it a tunable and don't add a new kernel config option at all. Tunables can be set from the kernel config now, I believe.

Drop PAGEOUT_THREADS conf option and substitute vm.pageout_threads_per_domain tunable.

$ ps auxwwwwHS | grep dom0
root  18   0.0  0.0      0   64  -  DL   13:33        0:00.09 [pagedaemon/dom0]
root  18   0.0  0.0      0   64  -  DL   13:33        0:00.00 [pagedaemon/dom0 hel]

$ sysctl vm.pageout_threads_per_domain
vm.pageout_threads_per_domain: 2