Support for "fake" NUMA domains for scaling the page queues.
Needs ReviewPublic
Actions

Authored by gallatin on Jan 23 2017, 8:26 PM.

Details

Reviewers

kib
scottl
alc
markj
jhb
glebius

Summary

This commit provides a tunable (vm.fake_doms) which can be
used to increase the number of NUMA domains. The purpose
is to break up the vm page queues to reduce lock contention,
particularly on the inactive page queue. For Netflix style
read-mostly workloads, this provides a substantial speedup.

Efforts have been made to properly track the backing NUMA
domain for each fake domain, so that affinity information
is preserved.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 7493
Build 7654: arc lint + arc unit

Event Timeline

gallatin updated this revision to Diff 24354.Jan 23 2017, 8:26 PM

gallatin retitled this revision from to Support for "fake" NUMA domains for scaling the page queues..

gallatin updated this object.

gallatin edited the test plan for this revision. (Show Details)

gallatin added reviewers: kib, jhb, markj, scottl, glebius.

markj added a reviewer: alc.Jan 23 2017, 8:31 PM

kbowling added a subscriber: kbowling.Jan 24 2017, 1:20 AM

I believe that on x86 it is very useful to keep whole lowest 4G in the same domain.

sys/vm/vm_page.c
452	I did not found anything that would prevent running past the end of the array if used misconfigured.
614	Use ANSI C definition.
634	Style: space before '('.
658	ANSI C definition.

Address Kib's feedback

Hi,

Thanks for the feedback

I'd prefer, if possible, to not try to treat domain 0 specially to ensure that at least 4GB resides there. I would prefer that this only be used on systems with large amounts of RAM to keep things (almost) simple.

sys/vm/vm_page.c
452	Thanks for catching this. I added an error condition if this overrun is about to happen.

kib added inline comments.Feb 11 2017, 6:41 PM

sys/vm/vm_page.c
436	Style: you do not need () there.
448	Style: no need in () there and on the previous line.
497	Stylish expression for the function would be vm_paddr_t left, right; left = (const vm_paddr_t )p1; right = (const vm_paddr_t )p2; return ((left > right) - (left < right));
518	Could you, please, explain how 'no more than 4 cpus' is related to the condition checked ?
628	Style: one blank line is enough.
635	Would CPU_FOREACH() useful for this loop ?
641	Isn't the dom variable not initialized there, at least on the very first iteration of the loop ?
645	Style: internal ()s are not needed.
652	What is the new semantic of pc_domain pcpu member ? Previously it was the NUMA domain. Now, with one real domain split into several fake domains, what is assigned there ? I understand how this would work when there is enough CPUs belonging to the same real domain so that we can split them into one for each fake dom. But if number of CPUs in a domain is less than the number of fakes, then selecting one arbitrary domain might break vm_domain_iterator_set() for VM_POLICY_FIRST_TOUCH{,_ROUND_ROBIN) ? Did I mis-understood the algorithm ?
685	Style: there should be a blank line after '{' if there is no locals.
698	Style: 4 spaces indent.

style fixes, as well as making sure to initialize the domain.

Thanks again for all the style fixes (we really need a checkpatch, a-la linux, or a c-style like opensolaris -- having hopped between OSes so much, i'm terrible at style(9). And thanks for catching the uninitialized variable.

sys/vm/vm_page.c
518	It was just me being a bit sloppy, and forgetting consumer non-power-of-two cpu core configurations. The autotuning feature was never used by us, and is dubious at best, so I've just removed it for simplicity.
641	Indeed. I think it was a mis-merge when merging from our tree to vanilla FreeBSD. I'm a bit horrified that the compiler didn't catch it.
652	The pc_domain pcpu member is the "fake" domain. I had not considered the case were the number of fakes is larger than the number of CPUs per real domain. That seems to be one of many possible misconfigurations. Do you think that I should guard against the most likely misconfigurations (such as trying to use "too many" fake domains)? After removing the mis-guided autoconfiguration, any misconfiguration would be the fault of the administrator, and I'd just be happier to blame him/her. :)

kib added inline comments.Feb 15 2017, 7:18 PM

sys/vm/vm_page.c
652	IMO we should try to detect mis-configuration and somehow react to it. I think that it is completely fine to detect the problems during the data structures creation and fail to start, by e.g. panic with some hint about the reason. I mean, we do not need to try to sanity-check the config in advance and try to fix it, just panicing when config process detected problem is ok. It is much worse to ignore the pending issue at the config stage and then misbehave at runtime, after the system starts carrying user data.

Add sanity checking to reject bad configs.
Make original physical domain count available

via sysctl

jhb added inline comments.Feb 16 2017, 6:13 PM

sys/vm/vm_page.c
652	I think we should not change pc_domain as it has meanings in other contexts (e.g. it is supposed to line up with what bus_get_domain() returns for device affinity). I would perhaps suggest having a separate pc_mem_domain or something to hold the fake domain if this is only about further subdividing memory.

gallatin added inline comments.Feb 17 2017, 3:46 PM

sys/vm/vm_page.c
652	Thanks for bringing this up. I'd forgotten about the device affinity. There are 2 different uses for the device affinity: allocating memory on the closest numa node binding (i)threads to the closest numa node Having a separate pc_mem_domain might cause (1) to fail. As an alternative, what about adjusting the domain returned by acpi_get_domain() when the fake code is active? We already track the physical backing domain, so it would be easy to just search the domain array & return the first domain whose physical domain matches the device.

jhb added inline comments.Feb 17 2017, 7:24 PM

sys/vm/vm_page.c
652	I think it will be good to have a way to find the "real" domain of a CPU still. Other things to consider are what 'cpuset -g -d <n>' returns. Also, what should bus_get_cpus() do. The device affinity thing is a bit messy. For example, a NIC might want to create per-CPU threads for each CPU in its local domain and then allocate the descriptor rings in the accompanying domain. If bus_get_cpus(INTR_CPUS/LOCAL_CPUS) doesn't honor the fake domains, then you will end up with CPUs in a different fake domain still having "local" memory in terms of the physical NUMA, but perhaps using the "wrong" domain if the driver uses bus_get_domain() instead of the current CPU's domain. If we change device affinity to just choose the first "fake" domain, and if we fix "cpuset -gd" to honor fake domains, then I think bus_get_cpus() will honor fake domains by default. This will avoid the confusion, but it will make devices potentially use fewer CPUs for interrupts.

jhb added inline comments.Feb 21 2017, 8:05 PM

sys/vm/vm_page.c
652	To be clear, (since I was thinking out loud a bit in my comment), I think the simplest approach would be to make devices just sit in "one" of the fake domains. If the global per-domain cpusets (which is what cpuset -gd returns) are fixed to represent the fake domains, and if the ACPI PXM to domain method honors the fake domains, then I think this will mostly just work (including bus_get_cpus()). The downside is potentially fewer IRQ threads for multi-queue devices, but I think that's the best set of tradeoffs that I can see.