Changeset View
Standalone View
sys/vm/vm_pageout.c
Show First 20 Lines • Show All 160 Lines • ▼ Show 20 Lines | SYSCTL_INT(_vm, OID_AUTO, panic_on_oom, | ||||
CTLFLAG_RWTUN, &vm_panic_on_oom, 0, | CTLFLAG_RWTUN, &vm_panic_on_oom, 0, | ||||
"Panic on the given number of out-of-memory errors instead of killing the largest process"); | "Panic on the given number of out-of-memory errors instead of killing the largest process"); | ||||
SYSCTL_INT(_vm, OID_AUTO, pageout_update_period, | SYSCTL_INT(_vm, OID_AUTO, pageout_update_period, | ||||
CTLFLAG_RWTUN, &vm_pageout_update_period, 0, | CTLFLAG_RWTUN, &vm_pageout_update_period, 0, | ||||
"Maximum active LRU update period"); | "Maximum active LRU update period"); | ||||
/* Access with get_pageout_threads_per_domain(). */ | /* Access with get_pageout_threads_per_domain(). */ | ||||
static int pageout_threads_per_domain = 1; | static int pageout_threads_per_domain; | ||||
SYSCTL_INT(_vm, OID_AUTO, pageout_threads_per_domain, CTLFLAG_RDTUN, | static int pageout_cpus_per_thread = 16; | ||||
&pageout_threads_per_domain, 0, | SYSCTL_INT(_vm, OID_AUTO, pageout_cpus_per_thread, CTLFLAG_RDTUN, | ||||
"Number of worker threads comprising each per-domain pagedaemon"); | &pageout_cpus_per_thread, 0, | ||||
"Number of CPUs per pagedaemon worker thread"); | |||||
SYSCTL_INT(_vm, OID_AUTO, lowmem_period, CTLFLAG_RWTUN, &lowmem_period, 0, | SYSCTL_INT(_vm, OID_AUTO, lowmem_period, CTLFLAG_RWTUN, &lowmem_period, 0, | ||||
"Low memory callback period"); | "Low memory callback period"); | ||||
SYSCTL_INT(_vm, OID_AUTO, disable_swapspace_pageouts, | SYSCTL_INT(_vm, OID_AUTO, disable_swapspace_pageouts, | ||||
CTLFLAG_RWTUN, &disable_swap_pageouts, 0, "Disallow swapout of dirty pages"); | CTLFLAG_RWTUN, &disable_swap_pageouts, 0, "Disallow swapout of dirty pages"); | ||||
static int pageout_lock_miss; | static int pageout_lock_miss; | ||||
▲ Show 20 Lines • Show All 2,017 Lines • ▼ Show 20 Lines | for (;;) { | ||||
blockcount_release(&vmd->vmd_inactive_running, 1); | blockcount_release(&vmd->vmd_inactive_running, 1); | ||||
} | } | ||||
} | } | ||||
static int | static int | ||||
get_pageout_threads_per_domain(void) | get_pageout_threads_per_domain(void) | ||||
{ | { | ||||
static bool resolved = false; | static bool resolved = false; | ||||
int half_cpus_per_dom; | int cpus_per_dom; | ||||
/* | /* | ||||
* This is serialized externally by the sorted autoconfig portion of | * This is serialized externally by the sorted autoconfig portion of | ||||
* boot. | * boot. | ||||
*/ | */ | ||||
if (__predict_true(resolved)) | if (resolved) | ||||
return (pageout_threads_per_domain); | return (pageout_threads_per_domain); | ||||
cpus_per_dom = howmany(mp_ncpus, vm_ndomains); | |||||
kib: This is rather arbitrary. I think you need a count of cpus in the max populated domain there. | |||||
cemAuthorUnsubmitted Done Inline ActionsSure, it is arbitrary. (Do you have a link to some of these Intel NUMA configs that aren't equally sized? I believe it but haven't seen it yet.) As far as empty domains, there is already logic in vm_pageout() to avoid creating pagedaemons, including helper threads, for empty domains. But you're right in that those domain CPUs may still add pagedaemon load and the proposed scaling logic does not account for that. This is a bit of an oddball case (2990WX, as well as maybe low-end or mis-installed servers; anything else?). So probably what we would really like is:
This falls down if you have *really* disproportionate ratios of CPU to memory, but such oddball configurations are unlikely to perform well in practice anyway. Does that algorithm sound better? cem: Sure, it is arbitrary.
(Do you have a link to some of these Intel NUMA configs that aren't… | |||||
kibUnsubmitted Not Done Inline ActionsI am not sure that load on the domain is expressed by amount of memory, and not by the count of CPUs. This is why I think that existing patch is mostly fine, except for case of non-symmetric domains. It is CPU allocations and memory accesses which create work for pagedaemon, not the memory itself. So if you substitute mem_size by cpu count, the proposed algorithm sounds appropriate. The feature I referred to is in fact called Cluster on Die (CoD). It probably come to extinguish with mesh in Skylakes, but Haswells/Broadwells are still useful CPUs (2017). Hm, it seems that the same feature is called sub-numa clustering (SNC) for Skylake Xeons https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-processor-scalable-family-technical-overview.html kib: I am not sure that load on the domain is expressed by amount of memory, and not by the count of… | |||||
cemAuthorUnsubmitted Done Inline Actions
Hm, it's probably some of both. If the memory is really disproportionate, you can imagine the smaller domain is more likely to be depleted and its CPUs may generate more cross-domain allocations. In the non-depleted case, load probably tracks CPUs.
Will do.
Thanks! cem: > I am not sure that load on the domain is expressed by amount of memory, and not by the count… | |||||
/* | /* | ||||
* Semi-arbitrarily constrain pagedaemon threads to less than half the | * Semi-arbitrarily constrain pagedaemon threads to less than half the | ||||
* total number of threads in the system as an insane upper limit. | * total number of CPUs in the system as an upper limit. | ||||
*/ | */ | ||||
half_cpus_per_dom = howmany(mp_ncpus / vm_ndomains, 2); | if (pageout_cpus_per_thread < 2) | ||||
pageout_cpus_per_thread = 2; | |||||
else if (pageout_cpus_per_thread > cpus_per_dom) | |||||
pageout_cpus_per_thread = cpus_per_dom; | |||||
if (pageout_threads_per_domain < 1) { | pageout_threads_per_domain = howmany(cpus_per_dom, | ||||
kibUnsubmitted Not Done Inline ActionsWhy not allow user to override the heuristic ? kib: Why not allow user to override the heuristic ? | |||||
cemAuthorUnsubmitted Done Inline ActionsWhy add unnecessary complexity? If someone needs it they can add it. With the proposal above, I am envisioning the global ratio of pageout_threads_per_domain to disappear entirely. cem: Why add unnecessary complexity? If someone needs it they can add it.
With the proposal above… | |||||
kibUnsubmitted Not Done Inline ActionsIf you add a single tunable that allows user to override any heuristic for pageout_threads_per_domain, I would not call it a complexity at all. It should be a single line for TUNABLE_INT_FETCH. But it allows users to experiment with settings without changing code. kib: If you add a single tunable that allows user to override any heuristic for… | |||||
cemAuthorUnsubmitted Done Inline ActionsFor the vast majority of machines (equal CPU count in each domain) these tunables are inverses of each other; there's no point having both. The new knob can be adjusted without changing code. For the heterogeneous CPUs per domain case, I don't think fixing the number of threads per dom makes all that much sense. Either way, we would need to document this knob and how it overrides the other one in a non-confusing way. cem: For the vast majority of machines (equal CPU count in each domain) these tunables are inverses… | |||||
markjUnsubmitted Not Done Inline ActionsIn the existing diff, is there any easy way to specify the old behaviour of one thread per domain? To do that it looks like I'd have to reverse-engineer the heuristic to make sure it returns 1. I think kib's suggestion is to stop making pageout_cpus_per_thread overridable, and instead make the entire heuristic overridable. markj: In the existing diff, is there any easy way to specify the old behaviour of one thread per… | |||||
cemAuthorUnsubmitted Done Inline Actions
Sure: vm.pageout_cpus_per_thread="9999999" I don't think removing the pageout_cpus_per_thread knob makes much sense. cem: > In the existing diff, is there any easy way to specify the old behaviour of one thread per… | |||||
printf("Invalid tuneable vm.pageout_threads_per_domain value: " | pageout_cpus_per_thread); | ||||
"%d out of valid range: [1-%d]; clamping to 1\n", | |||||
pageout_threads_per_domain, half_cpus_per_dom); | |||||
pageout_threads_per_domain = 1; | |||||
} else if (pageout_threads_per_domain > half_cpus_per_dom) { | |||||
printf("Invalid tuneable vm.pageout_threads_per_domain value: " | |||||
"%d out of valid range: [1-%d]; clamping to %d\n", | |||||
pageout_threads_per_domain, half_cpus_per_dom, | |||||
half_cpus_per_dom); | |||||
pageout_threads_per_domain = half_cpus_per_dom; | |||||
} | |||||
resolved = true; | resolved = true; | ||||
return (pageout_threads_per_domain); | return (pageout_threads_per_domain); | ||||
} | } | ||||
/* | /* | ||||
* Initialize basic pageout daemon settings. See the comment above the | * Initialize basic pageout daemon settings. See the comment above the | ||||
* definition of vm_domain for some explanation of how these thresholds are | * definition of vm_domain for some explanation of how these thresholds are | ||||
* used. | * used. | ||||
*/ | */ | ||||
static void | static void | ||||
vm_pageout_init_domain(int domain) | vm_pageout_init_domain(int domain) | ||||
Done Inline ActionsHow does this interact with empty domains? Suppose I have two sockets, each with 16 CPUs, and one socket contains no memory. This kind of configuration is common in threadripper systems. Then I believe we'll only create one pagedaemon thread for the entire system. If the idea behind the heuristic is that 16 CPUs can generate enough memory pressure to keep a pagedaemon worker busy, then in this case we are discounting load from the other 16 CPUs. markj: How does this interact with empty domains? Suppose I have two sockets, each with 16 CPUs, and… | |||||
Done Inline ActionsThe intention was to handle empty-domain systems like 2990WX correctly, but I messed up the translation from fraction-of-memory to fraction-of-cpus. The logic should subtract the CPUs belonging to empty domains from mp_ncpus for the divisor. cem: The intention was to handle empty-domain systems like 2990WX correctly, but I messed up the… | |||||
{ | { | ||||
struct vm_domain *vmd; | struct vm_domain *vmd; | ||||
struct sysctl_oid *oid; | struct sysctl_oid *oid; | ||||
vmd = VM_DOMAIN(domain); | vmd = VM_DOMAIN(domain); | ||||
vmd->vmd_interrupt_free_min = 2; | vmd->vmd_interrupt_free_min = 2; | ||||
/* | /* | ||||
▲ Show 20 Lines • Show All 155 Lines • Show Last 20 Lines |
This is rather arbitrary. I think you need a count of cpus in the max populated domain there.
There is no guarantee that all domains are equally sized, and Intel' NUMA on chip configs are not.
Hm, this actually means that it makes sense to calculate the number of od threads per domain, looking at the number of CPUs (and perhaps excluding domains without populated memory ?).