Page MenuHomeFreeBSD

Allow empty NUMA domains to support AMD 2990WX
ClosedPublic

Authored by gallatin on Aug 21 2018, 8:39 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Mar 7, 2:25 AM
Unknown Object (File)
Sat, Mar 2, 10:28 PM
Unknown Object (File)
Jan 15 2024, 9:29 AM
Unknown Object (File)
Dec 29 2023, 11:27 PM
Unknown Object (File)
Dec 24 2023, 5:16 PM
Unknown Object (File)
Dec 20 2023, 4:57 AM
Unknown Object (File)
Dec 17 2023, 10:44 AM
Unknown Object (File)
Dec 10 2023, 2:25 PM
Subscribers

Details

Reviewers
jeff
markj
alc
Summary

The AMD Threadripper 2990WX is basically a slightly crippled Epyc. Rather than having 4 memory controllers, one per NUMA domain, it has 2 memory controllers. This means that 2 of the 4 NUMA domains are populated with physical memory, and the others are empty.

This patch adds support to FreeBSD for empty NUMA domains by:

  • creating empty memory domains when parsing the SRAT table, rather than failing to parse the table
  • not running the pageout deamon threads in empty domains
  • adding defensive code to UMA to avoid allocating from empty domains
  • adding defensive code to cpuset to avoid binding to an empty domain

Thanks to Jeff for suggesting this strategy.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 19775

Event Timeline

sys/vm/uma_core.c
2717

This does not look right.

sys/vm/uma_core.c
2717

Hi Alan,

Can you clarify just a bit please? What does not look right?

Thanks!

sys/kern/kern_cpuset.c
496–498

DOMAINSET_FOREACH()
if (isset)

return (true)

return false

2113–2114

This is not quite right. We may want to just uniformly clear empty domains from any set before checking for empty sets. There is nothing specific to RR though, it is just the easiest way to make a broken policy from userspace. We have to ask what is going to be least surprising to a user process. They won't know about domains with missing memory. They will probably want to set a domain mask and allocation policy and have it just work with whatever is left. If they attempt to require only the local domain failing is one option, or simply picking the next closest domains and silently working is another option.

In vm_domainset.c we should probably fall back to interleave and not round-robin when a policy is impossible as well.

sys/vm/uma_core.c
2717

The ()'s are misplaced.

gallatin marked an inline comment as done.
sys/kern/kern_cpuset.c
496–498

DOMAINSET_FOREACH does not exist, and I think creating it may be outside the scope of this change. FWIW, the DOMAINSET_FLS() idiom is used twice more besides my usage

2113–2114

I've fallen back to interleave across all domains as you suggested.

I'm afraid I don't understand your comment about fixing things in vm_domainset.c It seems like if we fix them while the policy is being set, we don't need to do the work at allocation time..?

sys/vm/uma_core.c
2717

Thank you!. Fixed in the next patch

I haven't had much of a chance to review this yet, but I wanted to test an "options NUMA" kernel on a system with an empty domain (dual socket Haswell xeon with only one DIMM installed at the moment). This patch worked fine for me. Thanks!

I haven't had much of a chance to review this yet, but I wanted to test an "options NUMA" kernel on a system with an empty domain (dual socket Haswell xeon with only one DIMM installed at the moment). This patch worked fine for me. Thanks!

Glad that it was helpful!

We already have code to renumber NUMA domains to avoid holes in the domain ID space (renumber_domains() in srat.c). Wouldn't it be simpler to apply a similar approach to empty domains, or is there some reason that won't work?

We already have code to renumber NUMA domains to avoid holes in the domain ID space (renumber_domains() in srat.c). Wouldn't it be simpler to apply a similar approach to empty domains, or is there some reason that won't work?

I'm not sure how renumbering helps. At least in my desktop, we have 4 domains. Domains 0 and 2 have both memory and CPU cores. Domains 1 and 3 are empty of memory but have CPU cores. If we renumber, don't we loose information? My goal was to allow srat to express to the empty domains so that the VM / UMA system could use that information to provide better performance for tasks scheduled on domains 0 and 2.

We already have code to renumber NUMA domains to avoid holes in the domain ID space (renumber_domains() in srat.c). Wouldn't it be simpler to apply a similar approach to empty domains, or is there some reason that won't work?

I'm not sure how renumbering helps. At least in my desktop, we have 4 domains. Domains 0 and 2 have both memory and CPU cores. Domains 1 and 3 are empty of memory but have CPU cores. If we renumber, don't we loose information?

Right, but how are we making use of that information now? I'm not sure what the SLIT looks like on the 2990WX (what does "sysctl vm.phys_locality" report?) but in other cases we might want to, based on locality information, treat all CPUs in domain 1 as being in domain 0, and CPUs in domain 3 as being in domain 2.

Anyway, I'm just concerned that this approach is a bit fragile given the number of places where we have to adjust a "domain" parameter. It looks ok to me modulo the comments below.

sys/vm/uma_core.c
2657

Extra newline.

3697

I think this routine needs some handling too: kmem_alloc_domain() will just fail if we specify an empty domain, so malloc_domain() will fail for large allocations.

We already have code to renumber NUMA domains to avoid holes in the domain ID space (renumber_domains() in srat.c). Wouldn't it be simpler to apply a similar approach to empty domains, or is there some reason that won't work?

I'm not sure how renumbering helps. At least in my desktop, we have 4 domains. Domains 0 and 2 have both memory and CPU cores. Domains 1 and 3 are empty of memory but have CPU cores. If we renumber, don't we loose information?

Right, but how are we making use of that information now? I'm not sure what the SLIT looks like on the 2990WX (what does "sysctl vm.phys_locality" report?) but in other cases we might want to, based on locality information, treat all CPUs in domain 1 as being in domain 0, and CPUs in domain 3 as being in domain 2.

Anyway, I'm just concerned that this approach is a bit fragile given the number of places where we have to adjust a "domain" parameter. It looks ok to me modulo the comments below.

If you treat all domain 1 cores as effectively belonging to domain 0, then a (default) first-touch allocation policy for a program running on a domain 1 core is going to be bandwidth starved compared to programs running on domain 0 (or 2) cores. At least for 1st generation EPYC, and I'm assuming that the following fact hasn't really changed, the inter-die link bandwidth is only about 1/2 local memory bandwidth. So, if the default policy for a thread running on a domain 1 core is interleave, then it will have higher latency but at least it won't bandwidth starved.

The bottom line is that I don't think we should simplify things by assigning domain 1 cores to domain 0, etc.

(Since the 1st generation of Threadripper only had two active die, they could repurpose the links that would have connected to the inactive die to provide greater bandwidth between the two active die. So, the inter-die interconnect was never a bandwidth bottleneck on 1st generation Threadripper, as it is on 1st generation EPYC.)

Added check for empty domains in uma_large_malloc_domain(), and removed added blank line, as pointed out by Mark

sys/kern/kern_cpuset.c
487

"silenty" is misspelled. Add another space after the question mark.

sys/vm/uma_core.c
3703

There should be one less 't' in 'targetting'.

kmem_back() now also needs updating.

  • Fixed spelling errors pointed out by alc
  • Updated to account for recent VM changes:
    • added empty domain guard in kmm_back()
    • fixed conflict in keg_fetch_slab() after r338755

kmem_back() now also needs updating.

Done, thanks for pointing it out.

This revision is now accepted and ready to land.Sep 24 2018, 3:37 PM
sys/vm/vm_kern.c
505–506

Andrew, how are the domains numbered on the new Threadripper? I'm concerned about the possibility that the domains with memory are numbered 0 and 1, and the domains without memory are 2 and 3. Then, allocations by processors in domains 2 and 3 would always come from domain 0.

sys/vm/uma_core.c
2571

As in uma_zalloc_domain(), we need to set domain = UMA_ANYDOMAIN if the requested domain is empty. keg_fetch_slab() tries to honour the request domain.

gallatin added inline comments.
sys/vm/uma_core.c
2571

I'm confused by this comment. Are you saying that we should check for empty domains in uma_zalloc_domain()? I had though the check in zone_alloc_item() would suffice.

sys/vm/vm_kern.c
505–506

In this case, we're safe.. They populate domain0 and domain2. In general, I see your concern. Hopefully, nobody will ever make a machine like this again.

sys/vm/uma_core.c
2571

Ignore my comment, sorry, I got confused while reading code.

Andrew, given that amd64's GENERIC now has the NUMA option enabled by default, I would encourage you to commit this soon. We definitely want this change in 12.0.

sys/vm/vm_kern.c
505–506

Setting aside the newest Threadripper chips, I'm sure that we also run into the occasional machine were one or more domains don't have memory installed.

This was committed as r339043. Due to a cut and paste error, the wrong review was linked from the commit message, so the review was not auto-closed.

Nice work, thanks for this fix!