Paths

Table of Contentst

Allow empty NUMA domains to support AMD 2990WX
ClosedPublic
Actions

Authored by gallatin on Aug 21 2018, 8:39 PM.

Details

Reviewers

jeff
markj
alc

Summary

The AMD Threadripper 2990WX is basically a slightly crippled Epyc. Rather than having 4 memory controllers, one per NUMA domain, it has 2 memory controllers. This means that 2 of the 4 NUMA domains are populated with physical memory, and the others are empty.

This patch adds support to FreeBSD for empty NUMA domains by:

creating empty memory domains when parsing the SRAT table, rather than failing to parse the table
not running the pageout deamon threads in empty domains
adding defensive code to UMA to avoid allocating from empty domains
adding defensive code to cpuset to avoid binding to an empty domain

Thanks to Jeff for suggesting this strategy.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 19775

Event Timeline

gallatin created this revision.Aug 21 2018, 8:39 PM

Herald added a subscriber: imp. · View Herald TranscriptAug 21 2018, 8:39 PM

Harbormaster completed remote builds in B19023: Diff 47066.Aug 21 2018, 8:39 PM

alc added inline comments.Aug 22 2018, 4:36 AM

sys/vm/uma_core.c
2717	This does not look right.

gallatin added inline comments.Aug 22 2018, 5:45 PM

sys/vm/uma_core.c
2717	Hi Alan, Can you clarify just a bit please? What does not look right? Thanks!

jeff added inline comments.Aug 22 2018, 5:57 PM

sys/kern/kern_cpuset.c
496–498	DOMAINSET_FOREACH() if (isset) return (true) return false
2113–2114	This is not quite right. We may want to just uniformly clear empty domains from any set before checking for empty sets. There is nothing specific to RR though, it is just the easiest way to make a broken policy from userspace. We have to ask what is going to be least surprising to a user process. They won't know about domains with missing memory. They will probably want to set a domain mask and allocation policy and have it just work with whatever is left. If they attempt to require only the local domain failing is one option, or simply picking the next closest domains and silently working is another option. In vm_domainset.c we should probably fall back to interleave and not round-robin when a policy is impossible as well.

alc added inline comments.Aug 23 2018, 5:58 PM

sys/vm/uma_core.c
2717	The `()`'s are misplaced.

gallatin updated this revision to Diff 47200.Aug 23 2018, 6:23 PM

gallatin marked an inline comment as done.

Harbormaster completed remote builds in B19113: Diff 47200.Aug 23 2018, 6:23 PM

gallatin added inline comments.Aug 23 2018, 6:25 PM

sys/kern/kern_cpuset.c
496–498	DOMAINSET_FOREACH does not exist, and I think creating it may be outside the scope of this change. FWIW, the DOMAINSET_FLS() idiom is used twice more besides my usage
2113–2114	I've fallen back to interleave across all domains as you suggested. I'm afraid I don't understand your comment about fixing things in vm_domainset.c It seems like if we fix them while the policy is being set, we don't need to do the work at allocation time..?
sys/vm/uma_core.c
2717	Thank you!. Fixed in the next patch

I haven't had much of a chance to review this yet, but I wanted to test an "options NUMA" kernel on a system with an empty domain (dual socket Haswell xeon with only one DIMM installed at the moment). This patch worked fine for me. Thanks!

In D16836#359533, @markj wrote:

I haven't had much of a chance to review this yet, but I wanted to test an "options NUMA" kernel on a system with an empty domain (dual socket Haswell xeon with only one DIMM installed at the moment). This patch worked fine for me. Thanks!

Glad that it was helpful!

kbowling added a parent revision: D17059: Enable options NUMA on amd64 GENERIC/MINIMAL:.Sep 6 2018, 8:14 AM

We already have code to renumber NUMA domains to avoid holes in the domain ID space (renumber_domains() in srat.c). Wouldn't it be simpler to apply a similar approach to empty domains, or is there some reason that won't work?

In D16836#365423, @markj wrote:

We already have code to renumber NUMA domains to avoid holes in the domain ID space (renumber_domains() in srat.c). Wouldn't it be simpler to apply a similar approach to empty domains, or is there some reason that won't work?

I'm not sure how renumbering helps. At least in my desktop, we have 4 domains. Domains 0 and 2 have both memory and CPU cores. Domains 1 and 3 are empty of memory but have CPU cores. If we renumber, don't we loose information? My goal was to allow srat to express to the empty domains so that the VM / UMA system could use that information to provide better performance for tasks scheduled on domains 0 and 2.

In D16836#365616, @gallatin wrote:

In D16836#365423, @markj wrote:

We already have code to renumber NUMA domains to avoid holes in the domain ID space (renumber_domains() in srat.c). Wouldn't it be simpler to apply a similar approach to empty domains, or is there some reason that won't work?

I'm not sure how renumbering helps. At least in my desktop, we have 4 domains. Domains 0 and 2 have both memory and CPU cores. Domains 1 and 3 are empty of memory but have CPU cores. If we renumber, don't we loose information?

Right, but how are we making use of that information now? I'm not sure what the SLIT looks like on the 2990WX (what does "sysctl vm.phys_locality" report?) but in other cases we might want to, based on locality information, treat all CPUs in domain 1 as being in domain 0, and CPUs in domain 3 as being in domain 2.

Anyway, I'm just concerned that this approach is a bit fragile given the number of places where we have to adjust a "domain" parameter. It looks ok to me modulo the comments below.

sys/vm/uma_core.c
2657	Extra newline.
3697	I think this routine needs some handling too: kmem_alloc_domain() will just fail if we specify an empty domain, so malloc_domain() will fail for large allocations.

In D16836#365687, @markj wrote:

In D16836#365616, @gallatin wrote:

In D16836#365423, @markj wrote:

We already have code to renumber NUMA domains to avoid holes in the domain ID space (renumber_domains() in srat.c). Wouldn't it be simpler to apply a similar approach to empty domains, or is there some reason that won't work?

I'm not sure how renumbering helps. At least in my desktop, we have 4 domains. Domains 0 and 2 have both memory and CPU cores. Domains 1 and 3 are empty of memory but have CPU cores. If we renumber, don't we loose information?

Right, but how are we making use of that information now? I'm not sure what the SLIT looks like on the 2990WX (what does "sysctl vm.phys_locality" report?) but in other cases we might want to, based on locality information, treat all CPUs in domain 1 as being in domain 0, and CPUs in domain 3 as being in domain 2.

Anyway, I'm just concerned that this approach is a bit fragile given the number of places where we have to adjust a "domain" parameter. It looks ok to me modulo the comments below.

If you treat all domain 1 cores as effectively belonging to domain 0, then a (default) first-touch allocation policy for a program running on a domain 1 core is going to be bandwidth starved compared to programs running on domain 0 (or 2) cores. At least for 1st generation EPYC, and I'm assuming that the following fact hasn't really changed, the inter-die link bandwidth is only about 1/2 local memory bandwidth. So, if the default policy for a thread running on a domain 1 core is interleave, then it will have higher latency but at least it won't bandwidth starved.

The bottom line is that I don't think we should simplify things by assigning domain 1 cores to domain 0, etc.

(Since the 1st generation of Threadripper only had two active die, they could repurpose the links that would have connected to the inactive die to provide greater bandwidth between the two active die. So, the inter-die interconnect was never a bandwidth bottleneck on 1st generation Threadripper, as it is on 1st generation EPYC.)

Added check for empty domains in uma_large_malloc_domain(), and removed added blank line, as pointed out by Mark

Harbormaster completed remote builds in B19628: Diff 48132.Sep 17 2018, 4:55 PM

gallatin marked 2 inline comments as done.Sep 17 2018, 4:55 PM

alc added inline comments.Sep 21 2018, 5:41 PM

sys/kern/kern_cpuset.c
487	"silenty" is misspelled. Add another space after the question mark.
sys/vm/uma_core.c
3703	There should be one less 't' in 'targetting'.

kmem_back() now also needs updating.

Fixed spelling errors pointed out by alc
Updated to account for recent VM changes:
- added empty domain guard in kmm_back()
- fixed conflict in keg_fetch_slab() after r338755

Harbormaster completed remote builds in B19775: Diff 48395.Sep 24 2018, 1:11 PM

In D16836#368220, @markj wrote:

kmem_back() now also needs updating.

Done, thanks for pointing it out.

markj accepted this revision.Sep 24 2018, 3:37 PM

This revision is now accepted and ready to land.Sep 24 2018, 3:37 PM

alc added inline comments.Sep 24 2018, 5:24 PM

sys/vm/vm_kern.c
505–506	Andrew, how are the domains numbered on the new Threadripper? I'm concerned about the possibility that the domains with memory are numbered 0 and 1, and the domains without memory are 2 and 3. Then, allocations by processors in domains 2 and 3 would always come from domain 0.

markj added inline comments.Sep 24 2018, 8:26 PM

sys/vm/uma_core.c
2571	As in uma_zalloc_domain(), we need to set domain = UMA_ANYDOMAIN if the requested domain is empty. keg_fetch_slab() tries to honour the request domain.

gallatin marked 2 inline comments as done.Sep 24 2018, 9:20 PM

gallatin added inline comments.

sys/vm/uma_core.c
2571	I'm confused by this comment. Are you saying that we should check for empty domains in uma_zalloc_domain()? I had though the check in zone_alloc_item() would suffice.
sys/vm/vm_kern.c
505–506	In this case, we're safe.. They populate domain0 and domain2. In general, I see your concern. Hopefully, nobody will ever make a machine like this again.

markj added inline comments.Sep 25 2018, 9:54 PM

sys/vm/uma_core.c
2571	Ignore my comment, sorry, I got confused while reading code.

alc accepted this revision.Sep 26 2018, 3:33 AM

Andrew, given that amd64's GENERIC now has the NUMA option enabled by default, I would encourage you to commit this soon. We definitely want this change in 12.0.

sys/vm/vm_kern.c
505–506	Setting aside the newest Threadripper chips, I'm sure that we also run into the occasional machine were one or more domains don't have memory installed.