Page MenuHomeFreeBSD

vm_pageout: Scale worker threads with CPUs
ClosedPublic

Authored by cem on Aug 21 2020, 7:57 PM.

Details

Summary

Autoscale vm_pageout worker threads from r364129 with CPU count. The
default is arbitrarily chosen to be 16 CPUs per worker thread, but can
be adjusted with the vm.pageout_cpus_per_thread tunable. The
vm.pageout_threads_per_domain tunable is removed to avoid confusion.

There will never be less than 1 thread per NUMA domain, and the previous
arbitrary upper limit (at most ncpus/2 threads per NUMA domain) is
preserved.

Test Plan
testvm# grep vm.pageout /boot/loader.conf
vm.pageout_cpus_per_thread="2"

testvm# sysctl vm.pageout_cpus_per_thread
vm.pageout_cpus_per_thread: 2

testvm# sysctl kern.smp.cpus
kern.smp.cpus: 4

testvm# ps auxwwwwwH|grep dom
root  18   0.0  0.0      0   64  -  DL   12:14        0:00.28 [pagedaemon/dom0]
root  18   0.0  0.0      0   64  -  DL   12:14        0:00.00 [pagedaemon/dom0 hel]

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

cem requested review of this revision.Aug 21 2020, 7:57 PM
cem created this revision.
sys/vm/vm_pageout.c
2216 ↗(On Diff #76075)

This is rather arbitrary. I think you need a count of cpus in the max populated domain there.
There is no guarantee that all domains are equally sized, and Intel' NUMA on chip configs are not.

Hm, this actually means that it makes sense to calculate the number of od threads per domain, looking at the number of CPUs (and perhaps excluding domains without populated memory ?).

2227 ↗(On Diff #76075)

Why not allow user to override the heuristic ?

Thanks!

sys/vm/vm_pageout.c
2216 ↗(On Diff #76075)

Sure, it is arbitrary.

(Do you have a link to some of these Intel NUMA configs that aren't equally sized? I believe it but haven't seen it yet.)

As far as empty domains, there is already logic in vm_pageout() to avoid creating pagedaemons, including helper threads, for empty domains. But you're right in that those domain CPUs may still add pagedaemon load and the proposed scaling logic does not account for that. This is a bit of an oddball case (2990WX, as well as maybe low-end or mis-installed servers; anything else?).

So probably what we would really like is:

  1. Calculate total number of pageout threads on the basis of pageout_cpus_per_thread
  2. get_pageout_threads_per_domain takes a domain parameter.
  3. Each domain gets something like total_pageout_threads * domain_mem_size / total_mem_size (proportional fraction of threads based on domain memory fraction).

This falls down if you have *really* disproportionate ratios of CPU to memory, but such oddball configurations are unlikely to perform well in practice anyway.

Does that algorithm sound better?

2227 ↗(On Diff #76075)

Why add unnecessary complexity? If someone needs it they can add it.

With the proposal above, I am envisioning the global ratio of pageout_threads_per_domain to disappear entirely.

sys/vm/vm_pageout.c
2216 ↗(On Diff #76075)

I am not sure that load on the domain is expressed by amount of memory, and not by the count of CPUs. This is why I think that existing patch is mostly fine, except for case of non-symmetric domains. It is CPU allocations and memory accesses which create work for pagedaemon, not the memory itself. So if you substitute mem_size by cpu count, the proposed algorithm sounds appropriate.

The feature I referred to is in fact called Cluster on Die (CoD). It probably come to extinguish with mesh in Skylakes, but Haswells/Broadwells are still useful CPUs (2017). Hm, it seems that the same feature is called sub-numa clustering (SNC) for Skylake Xeons https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-processor-scalable-family-technical-overview.html

2227 ↗(On Diff #76075)

If you add a single tunable that allows user to override any heuristic for pageout_threads_per_domain, I would not call it a complexity at all. It should be a single line for TUNABLE_INT_FETCH.

But it allows users to experiment with settings without changing code.

sys/vm/vm_pageout.c
2216 ↗(On Diff #76075)

I am not sure that load on the domain is expressed by amount of memory, and not by the count of CPUs.

Hm, it's probably some of both. If the memory is really disproportionate, you can imagine the smaller domain is more likely to be depleted and its CPUs may generate more cross-domain allocations. In the non-depleted case, load probably tracks CPUs.

So if you substitute mem_size by cpu count, the proposed algorithm sounds appropriate.

Will do.

The feature I referred to is in fact called Cluster on Die (CoD).

Thanks!

2227 ↗(On Diff #76075)

For the vast majority of machines (equal CPU count in each domain) these tunables are inverses of each other; there's no point having both. The new knob can be adjusted without changing code.

For the heterogeneous CPUs per domain case, I don't think fixing the number of threads per dom makes all that much sense.

Either way, we would need to document this knob and how it overrides the other one in a non-confusing way.

cem marked an inline comment as done.

Scale domain pageout threads in proportion to domain size (in CPUs)

sys/vm/vm_pageout.c
2227 ↗(On Diff #76075)

In the existing diff, is there any easy way to specify the old behaviour of one thread per domain? To do that it looks like I'd have to reverse-engineer the heuristic to make sure it returns 1.

I think kib's suggestion is to stop making pageout_cpus_per_thread overridable, and instead make the entire heuristic overridable.

2224 ↗(On Diff #76096)

How does this interact with empty domains? Suppose I have two sockets, each with 16 CPUs, and one socket contains no memory. This kind of configuration is common in threadripper systems. Then I believe we'll only create one pagedaemon thread for the entire system. If the idea behind the heuristic is that 16 CPUs can generate enough memory pressure to keep a pagedaemon worker busy, then in this case we are discounting load from the other 16 CPUs.

sys/vm/vm_pageout.c
2227 ↗(On Diff #76075)

In the existing diff, is there any easy way to specify the old behaviour of one thread per domain?

Sure: vm.pageout_cpus_per_thread="9999999"

I don't think removing the pageout_cpus_per_thread knob makes much sense.

2224 ↗(On Diff #76096)

The intention was to handle empty-domain systems like 2990WX correctly, but I messed up the translation from fraction-of-memory to fraction-of-cpus. The logic should subtract the CPUs belonging to empty domains from mp_ncpus for the divisor.

cem marked an inline comment as done.

Correctly discount CPUs in empty domains when allotting pagedaemon threads to
domains.

This revision is now accepted and ready to land.Aug 23 2020, 11:21 PM
This revision was automatically updated to reflect the committed changes.