vm_pageout: Scale worker threads with CPUs
ClosedPublic
Actions

Authored by cem on Aug 21 2020, 7:57 PM.

Details

Reviewers

markj
vangyzen
kib

Commits

rS364786: vm_pageout: Scale worker threads with CPUs

Summary

Autoscale vm_pageout worker threads from r364129 with CPU count. The
default is arbitrarily chosen to be 16 CPUs per worker thread, but can
be adjusted with the vm.pageout_cpus_per_thread tunable. The
vm.pageout_threads_per_domain tunable is removed to avoid confusion.

There will never be less than 1 thread per NUMA domain, and the previous
arbitrary upper limit (at most ncpus/2 threads per NUMA domain) is
preserved.

Test Plan

testvm# grep vm.pageout /boot/loader.conf
vm.pageout_cpus_per_thread="2"

testvm# sysctl vm.pageout_cpus_per_thread
vm.pageout_cpus_per_thread: 2

testvm# sysctl kern.smp.cpus
kern.smp.cpus: 4

testvm# ps auxwwwwwH|grep dom
root  18   0.0  0.0      0   64  -  DL   12:14        0:00.28 [pagedaemon/dom0]
root  18   0.0  0.0      0   64  -  DL   12:14        0:00.00 [pagedaemon/dom0 hel]

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

cem requested review of this revision.Aug 21 2020, 7:57 PM

cem created this revision.

Harbormaster completed remote builds in B33099: Diff 76075.Aug 21 2020, 7:58 PM

kib added inline comments.Aug 22 2020, 4:02 PM

sys/vm/vm_pageout.c
2216 ↗	(On Diff #76075)	This is rather arbitrary. I think you need a count of cpus in the max populated domain there. There is no guarantee that all domains are equally sized, and Intel' NUMA on chip configs are not. Hm, this actually means that it makes sense to calculate the number of od threads per domain, looking at the number of CPUs (and perhaps excluding domains without populated memory ?).
2227 ↗	(On Diff #76075)	Why not allow user to override the heuristic ?

Thanks!

sys/vm/vm_pageout.c
2216 ↗	(On Diff #76075)	Sure, it is arbitrary. (Do you have a link to some of these Intel NUMA configs that aren't equally sized? I believe it but haven't seen it yet.) As far as empty domains, there is already logic in `vm_pageout()` to avoid creating pagedaemons, including helper threads, for empty domains. But you're right in that those domain CPUs may still add pagedaemon load and the proposed scaling logic does not account for that. This is a bit of an oddball case (2990WX, as well as maybe low-end or mis-installed servers; anything else?). So probably what we would really like is: Calculate total number of pageout threads on the basis of `pageout_cpus_per_thread` `get_pageout_threads_per_domain` takes a domain parameter. Each domain gets something like `total_pageout_threads * domain_mem_size / total_mem_size` (proportional fraction of threads based on domain memory fraction). This falls down if you have really disproportionate ratios of CPU to memory, but such oddball configurations are unlikely to perform well in practice anyway. Does that algorithm sound better?
2227 ↗	(On Diff #76075)	Why add unnecessary complexity? If someone needs it they can add it. With the proposal above, I am envisioning the global ratio of `pageout_threads_per_domain` to disappear entirely.

kib added inline comments.Aug 22 2020, 5:00 PM

sys/vm/vm_pageout.c
2216 ↗	(On Diff #76075)	I am not sure that load on the domain is expressed by amount of memory, and not by the count of CPUs. This is why I think that existing patch is mostly fine, except for case of non-symmetric domains. It is CPU allocations and memory accesses which create work for pagedaemon, not the memory itself. So if you substitute mem_size by cpu count, the proposed algorithm sounds appropriate. The feature I referred to is in fact called Cluster on Die (CoD). It probably come to extinguish with mesh in Skylakes, but Haswells/Broadwells are still useful CPUs (2017). Hm, it seems that the same feature is called sub-numa clustering (SNC) for Skylake Xeons https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-processor-scalable-family-technical-overview.html
2227 ↗	(On Diff #76075)	If you add a single tunable that allows user to override any heuristic for pageout_threads_per_domain, I would not call it a complexity at all. It should be a single line for TUNABLE_INT_FETCH. But it allows users to experiment with settings without changing code.

cem added inline comments.Aug 22 2020, 5:17 PM

sys/vm/vm_pageout.c
2216 ↗	(On Diff #76075)	I am not sure that load on the domain is expressed by amount of memory, and not by the count of CPUs. Hm, it's probably some of both. If the memory is really disproportionate, you can imagine the smaller domain is more likely to be depleted and its CPUs may generate more cross-domain allocations. In the non-depleted case, load probably tracks CPUs. So if you substitute mem_size by cpu count, the proposed algorithm sounds appropriate. Will do. The feature I referred to is in fact called Cluster on Die (CoD). Thanks!
2227 ↗	(On Diff #76075)	For the vast majority of machines (equal CPU count in each domain) these tunables are inverses of each other; there's no point having both. The new knob can be adjusted without changing code. For the heterogeneous CPUs per domain case, I don't think fixing the number of threads per dom makes all that much sense. Either way, we would need to document this knob and how it overrides the other one in a non-confusing way.

Scale domain pageout threads in proportion to domain size (in CPUs)

Harbormaster completed remote builds in B33110: Diff 76096.Aug 23 2020, 3:36 AM

markj added inline comments.Aug 23 2020, 6:23 PM

sys/vm/vm_pageout.c
2224 ↗	(On Diff #76096)	How does this interact with empty domains? Suppose I have two sockets, each with 16 CPUs, and one socket contains no memory. This kind of configuration is common in threadripper systems. Then I believe we'll only create one pagedaemon thread for the entire system. If the idea behind the heuristic is that 16 CPUs can generate enough memory pressure to keep a pagedaemon worker busy, then in this case we are discounting load from the other 16 CPUs.
2227 ↗	(On Diff #76075)	In the existing diff, is there any easy way to specify the old behaviour of one thread per domain? To do that it looks like I'd have to reverse-engineer the heuristic to make sure it returns 1. I think kib's suggestion is to stop making pageout_cpus_per_thread overridable, and instead make the entire heuristic overridable.

cem added inline comments.Aug 23 2020, 10:26 PM

sys/vm/vm_pageout.c
2224 ↗	(On Diff #76096)	The intention was to handle empty-domain systems like 2990WX correctly, but I messed up the translation from fraction-of-memory to fraction-of-cpus. The logic should subtract the CPUs belonging to empty domains from `mp_ncpus` for the divisor.
2227 ↗	(On Diff #76075)	In the existing diff, is there any easy way to specify the old behaviour of one thread per domain? Sure: `vm.pageout_cpus_per_thread="9999999"` I don't think removing the pageout_cpus_per_thread knob makes much sense.