Page MenuHomeFreeBSD

Support for "fake" NUMA domains for scaling the page queues.
Needs ReviewPublic

Authored by gallatin on Jan 23 2017, 8:26 PM.
Tags
None
Referenced Files
Unknown Object (File)
Feb 6 2024, 2:33 PM
Unknown Object (File)
Jan 17 2024, 2:30 PM
Unknown Object (File)
Dec 20 2023, 12:47 PM
Unknown Object (File)
Dec 20 2023, 7:27 AM
Unknown Object (File)
Nov 24 2023, 4:29 PM
Unknown Object (File)
Nov 20 2023, 11:17 AM
Unknown Object (File)
Nov 1 2023, 3:42 PM
Unknown Object (File)
Oct 13 2023, 1:48 PM
Subscribers

Details

Summary

This commit provides a tunable (vm.fake_doms) which can be
used to increase the number of NUMA domains. The purpose
is to break up the vm page queues to reduce lock contention,
particularly on the inactive page queue. For Netflix style
read-mostly workloads, this provides a substantial speedup.

Efforts have been made to properly track the backing NUMA
domain for each fake domain, so that affinity information
is preserved.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 7493
Build 7654: arc lint + arc unit

Event Timeline

gallatin retitled this revision from to Support for "fake" NUMA domains for scaling the page queues..
gallatin updated this object.
gallatin edited the test plan for this revision. (Show Details)
gallatin added reviewers: kib, jhb, markj, scottl, glebius.

I believe that on x86 it is very useful to keep whole lowest 4G in the same domain.

sys/vm/vm_page.c
452

I did not found anything that would prevent running past the end of the array if used misconfigured.

614

Use ANSI C definition.

634

Style: space before '('.

658

ANSI C definition.

gallatin edited edge metadata.
  • Address Kib's feedback

Hi,

Thanks for the feedback

I'd prefer, if possible, to not try to treat domain 0 specially to ensure that at least 4GB resides there. I would prefer that this only be used on systems with large amounts of RAM to keep things (almost) simple.

sys/vm/vm_page.c
452

Thanks for catching this. I added an error condition if this overrun is about to happen.

sys/vm/vm_page.c
436

Style: you do not need () there.

448

Style: no need in () there and on the previous line.

497

Stylish expression for the function would be

        vm_paddr_t left, right;
	​
        left = *(const vm_paddr_t *)p1;
	​right = *(const vm_paddr_t *)p2;
        return ((left > right) - (left < right));
518

Could you, please, explain how 'no more than 4 cpus' is related to the condition checked ?

628

Style: one blank line is enough.

635

Would CPU_FOREACH() useful for this loop ?

641

Isn't the dom variable not initialized there, at least on the very first iteration of the loop ?

645

Style: internal ()s are not needed.

652

What is the new semantic of pc_domain pcpu member ? Previously it was the NUMA domain. Now, with one real domain split into several fake domains, what is assigned there ?

I understand how this would work when there is enough CPUs belonging to the same real domain so that we can split them into one for each fake dom. But if number of CPUs in a domain is less than the number of fakes, then selecting one arbitrary domain might break vm_domain_iterator_set() for VM_POLICY_FIRST_TOUCH{,_ROUND_ROBIN) ?

Did I mis-understood the algorithm ?

685

Style: there should be a blank line after '{' if there is no locals.

698

Style: 4 spaces indent.

gallatin marked 10 inline comments as done.
  • style fixes, as well as making sure to initialize the domain.

Thanks again for all the style fixes (we really need a checkpatch, a-la linux, or a c-style like opensolaris -- having hopped between OSes so much, i'm terrible at style(9). And thanks for catching the uninitialized variable.

sys/vm/vm_page.c
518

It was just me being a bit sloppy, and forgetting consumer non-power-of-two cpu core configurations.

The autotuning feature was never used by us, and is dubious at best, so I've just removed it for simplicity.

641

Indeed. I think it was a mis-merge when merging from our tree to vanilla FreeBSD. I'm a bit horrified that the compiler didn't catch it.

652

The pc_domain pcpu member is the "fake" domain.

I had not considered the case were the number of fakes is larger than the number of CPUs per real domain. That seems to be one of many possible misconfigurations.

Do you think that I should guard against the most likely misconfigurations (such as trying to use "too many" fake domains)? After removing the mis-guided autoconfiguration, any misconfiguration would be the fault of the administrator, and I'd just be happier to blame him/her. :)

sys/vm/vm_page.c
652

IMO we should try to detect mis-configuration and somehow react to it. I think that it is completely fine to detect the problems during the data structures creation and fail to start, by e.g. panic with some hint about the reason. I mean, we do not need to try to sanity-check the config in advance and try to fix it, just panicing when config process detected problem is ok.

It is much worse to ignore the pending issue at the config stage and then misbehave at runtime, after the system starts carrying user data.

  • Add sanity checking to reject bad configs.
  • Make original physical domain count available

via sysctl

sys/vm/vm_page.c
652

I think we should not change pc_domain as it has meanings in other contexts (e.g. it is supposed to line up with what bus_get_domain() returns for device affinity). I would perhaps suggest having a separate pc_mem_domain or something to hold the fake domain if this is only about further subdividing memory.

sys/vm/vm_page.c
652

Thanks for bringing this up. I'd forgotten about the device affinity. There are 2 different uses for the device affinity:

  1. allocating memory on the closest numa node
  2. binding (i)threads to the closest numa node

Having a separate pc_mem_domain might cause (1) to fail.

As an alternative, what about adjusting the domain returned by acpi_get_domain() when the fake code is active? We already track the physical backing domain, so it would be easy to just search the domain array & return the first domain whose physical domain matches the device.

sys/vm/vm_page.c
652

I think it will be good to have a way to find the "real" domain of a CPU still. Other things to consider are what 'cpuset -g -d <n>' returns. Also, what should bus_get_cpus() do.

The device affinity thing is a bit messy. For example, a NIC might want to create per-CPU threads for each CPU in its local domain and then allocate the descriptor rings in the accompanying domain. If bus_get_cpus(INTR_CPUS/LOCAL_CPUS) doesn't honor the fake domains, then you will end up with CPUs in a different fake domain still having "local" memory in terms of the physical NUMA, but perhaps using the "wrong" domain if the driver uses bus_get_domain() instead of the current CPU's domain.

If we change device affinity to just choose the first "fake" domain, and if we fix "cpuset -gd" to honor fake domains, then I think bus_get_cpus() will honor fake domains by default.
This will avoid the confusion, but it will make devices potentially use fewer CPUs for interrupts.

sys/vm/vm_page.c
652

To be clear, (since I was thinking out loud a bit in my comment), I think the simplest approach would be to make devices just sit in "one" of the fake domains. If the global per-domain cpusets (which is what cpuset -gd returns) are fixed to represent the fake domains, and if the ACPI PXM to domain method honors the fake domains, then I think this will mostly just work (including bus_get_cpus()). The downside is potentially fewer IRQ threads for multi-queue devices, but I think that's the best set of tradeoffs that I can see.