Page MenuHomeFreeBSD

amd64: Bump MAXCPU to 1024 (from 256)
ClosedPublic

Authored by emaste on Sep 30 2022, 12:18 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Dec 18, 8:56 AM
Unknown Object (File)
Fri, Dec 6, 1:40 AM
Unknown Object (File)
Fri, Dec 6, 1:40 AM
Unknown Object (File)
Fri, Dec 6, 1:40 AM
Unknown Object (File)
Fri, Dec 6, 1:40 AM
Unknown Object (File)
Fri, Dec 6, 1:40 AM
Unknown Object (File)
Fri, Dec 6, 1:40 AM
Unknown Object (File)
Wed, Dec 4, 12:26 AM

Details

Summary

Hardware with more than 256 cores is now available and will become increasingly common.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

emaste created this revision.

We have #define MAX_APIC_ID 0x200 and _Static_assert(MAXCPU <= MAX_APIC_ID "MAXCPU cannot be larger that MAX_APIC_ID");

emaste retitled this revision from amd64: Bump MAXCPU to 1024 (from 256) to amd64: Bump MAXCPU to 512 (from 256).

Just 512 at first

We'll want to go higher but will try starting here

There are many MAXCPU-sized arrays that we should investigate along with this work.

Examples from the first third or so from grep -w MAXCPU:

#define NPV_LIST_LOCKS  MAXCPU
struct pmap_pcids       pm_pcids[MAXCPU];
static struct asid asid[MAXCPU];
static uint8_t hsave[MAXCPU][PAGE_SIZE] __aligned(PAGE_SIZE);
int vmxon_enabled[MAXCPU];
static char vmxon_region[MAXCPU][PAGE_SIZE] __aligned(PAGE_SIZE);
long            eptgen[MAXCPU];         /* cached pmap->pm_eptgen */
static struct cam_doneq cam_doneqs[MAXCPU];
cpu_core_t      cpu_core[MAXCPU];
solaris_cpu_t   solaris_cpu[MAXCPU];
#define NCPU            MAXCPU
static int64_t  tsc_skew[MAXCPU];
struct dtrace_debug_data { ... } dtrace_debug_data[MAXCPU];
#define WORK_CPU_UNBOUND MAXCPU

/* Structure to store portals' properties */
struct XX_PortalInfo {
        vm_paddr_t      portal_ce_pa[2][MAXCPU];
        vm_paddr_t      portal_ci_pa[2][MAXCPU];
        uint32_t        portal_ce_size[2][MAXCPU];
        uint32_t        portal_ci_size[2][MAXCPU];
        vm_offset_t     portal_ce_va[2];
        vm_offset_t     portal_ci_va[2];
        uintptr_t       portal_intr[2][MAXCPU];
};

struct hpet_softc { ... int                     pcpu_slaves[MAXCPU]; ... }
#define ACPI_MAX_TASKS          MAX(32, MAXCPU * 4)
struct sysctl_oid *sc_sysctl_cpu[MAXCPU];
bool            sc_regs_mapped[MAXCPU]; /* register mapping status */
t_Handle        sc_bph[MAXCPU];         /* BMAN portal handles */
struct dpaa_portals_softc { ... dpaa_portal_t   sc_dp[MAXCPU]; ... };
struct qman_softc { ... bool            sc_regs_mapped[MAXCPU]; ... t_Handle        sc_qph[MAXCPU];         /* QMAN portal handles */ ... };
struct hv_storvsc_sysctl { ... u_long          chan_send_cnt[MAXCPU]; ... };
struct storvsc_softc { ... struct vmbus_channel            *hs_sel_chan[MAXCPU]; ... };

sys/netpfil/pf/pf.c:257:1: error: static_assert failed due to requirement '(1 << 8) >= 512' "compile-time assertion failed"
PFID_CPUBITS needs to be increased

Boot smoke test was successful with this additional change:

diff --git a/sys/netpfil/pf/pf.c b/sys/netpfil/pf/pf.c
index 33ef5119ee3c..64d8bb92d5f0 100644
--- a/sys/netpfil/pf/pf.c
+++ b/sys/netpfil/pf/pf.c
@@ -250,7 +250,7 @@ VNET_DEFINE(uma_zone_t,      pf_state_z);
 VNET_DEFINE(uma_zone_t,         pf_state_key_z);
 
 VNET_DEFINE(uint64_t, pf_stateid[MAXCPU]);
-#define        PFID_CPUBITS    8
+#define        PFID_CPUBITS    16
 #define        PFID_CPUSHIFT   (sizeof(uint64_t) * NBBY - PFID_CPUBITS)
 #define        PFID_CPUMASK    ((uint64_t)((1 << PFID_CPUBITS) - 1) << PFID_CPUSHIFT)
 #define        PFID_MAXID      (~PFID_CPUMASK)

https://cirrus-ci.com/task/6065229367869440

I built with PFID_CPUBITS=16 before discussing with @kp; testing with PFID_CPUBITS=10 now.

Boot smoke test was successful with this additional change:

diff --git a/sys/netpfil/pf/pf.c b/sys/netpfil/pf/pf.c
index 33ef5119ee3c..64d8bb92d5f0 100644
--- a/sys/netpfil/pf/pf.c
+++ b/sys/netpfil/pf/pf.c
@@ -250,7 +250,7 @@ VNET_DEFINE(uma_zone_t,      pf_state_z);
 VNET_DEFINE(uma_zone_t,         pf_state_key_z);
 
 VNET_DEFINE(uint64_t, pf_stateid[MAXCPU]);
-#define        PFID_CPUBITS    8
+#define        PFID_CPUBITS    16
 #define        PFID_CPUSHIFT   (sizeof(uint64_t) * NBBY - PFID_CPUBITS)
 #define        PFID_CPUMASK    ((uint64_t)((1 << PFID_CPUBITS) - 1) << PFID_CPUSHIFT)
 #define        PFID_MAXID      (~PFID_CPUMASK)

https://cirrus-ci.com/task/6065229367869440

I built with PFID_CPUBITS=16 before discussing with @kp; testing with PFID_CPUBITS=10 now.

I have a potential patch to remove that reliance on MAX_CPU, but go ahead and make the =10 change already if that unblocks things.

In D36838#836090, @kp wrote:

I have a potential patch to remove that reliance on MAX_CPU, but go ahead and make the =10 change already if that unblocks things.

Yes, it should be counter(9) based. The existing code predates counter(9). Something pretty close to what we have to generate IPv4 header ID. Please add me to reviewers. Thanks!

I have a potential patch to remove that reliance on MAX_CPU

I'm not quite ready to commit the MAXCPU change since we should sort out the affinity syscalls first and may want to go to right to 1024, so perhaps your pf change will be in first. I have changed PFID_CPUBITS to 10 in my WIP branch, and will commit it with other MAXCPU-related changes if your work is not in the tree at that time.

See also: CPU_MAXSIZE in <sys/sys/_cpuset.h>

For MAX_APIC_ID, that may be the max non-x2APIC ID which needs to stay at 255 if so. However, the assertion about MAXCPU being <= MAX_APIC_ID is no longer relevant with x2APIC and can be removed I think. I'm not sure if we support x2APIC on i386, so maybe the assertion has to stay for i386 still? (More of a @kib question)

In D36838#836090, @kp wrote:

I have a potential patch to remove that reliance on MAX_CPU, but go ahead and make the =10 change already if that unblocks things.

Yes, it should be counter(9) based. The existing code predates counter(9). Something pretty close to what we have to generate IPv4 header ID. Please add me to reviewers. Thanks!

I don't think that really works. It'd be basically the same as the current approach, and we'd have to keep adding in CPU bits to avoid conflicts.

My current plan is to use atomic_fetchadd_64(), but that breaks on i386 (although that does have atomic_fetchadd_long(), which may suffice).

This is what I have in mind: https://reviews.freebsd.org/D36915

This does work on i386, because I was misreading the compiler error at first.

But you need to also bump MAX_APIC_ID, or else the MADT parse code will start dropping CPUs that have IDs > 512. Note that APIC IDs don't need to be continuous, so it's possible to have a system only with even APIC IDs (ie: 0, 2, 4..) at which point in order to support 512 CPUs you might need to be able to parse APIC IDs up to 1024 or more.

It seems like the MAX_APIC_ID is a safeguard, because we have some data structures that are allocated based on the max APIC ID found on the system (lapics struct in madt.c). You need to also bump MAX_APIC_ID to twice the value of MAXCPU I would think, and ideally we would like to remove the dependency on allocating structures based on the max APIC ID and only allocate based on the number of CPUs on the system, or else we are likely to waste memory.

MAX_APIC_ID increase in D37006. I propose doing that as a separate change in advance of increasing MAXCPU, as MAXCPU may be set via the kernel config file and it is currently not possible to increase it.

emaste retitled this revision from amd64: Bump MAXCPU to 512 (from 256) to amd64: Bump MAXCPU to 1024 (from 256).

Increase to 1024, which is feasible after D37067

des added a subscriber: des.

lgtm, we should get this in now so we have a few months to see what breaks before 14.0.

This revision is now accepted and ready to land.Feb 16 2023, 12:02 AM

There are at least 2 things to do before this can land imo.

  1. 'size' the kernel binary before/after (preferably LINT)
  2. check total memory use from booting single user on a small box before/after, preferably single-core. i guess diffing sysctl -a would provide enough info?

Before:

> size kernel
      text      data       bss        dec         hex   filename
  23076646   1870505   4415872   29363023   0x1c00b4f   kernel
> size vmm.ko
    text    data       bss       dec        hex   filename
  177636   30744   5279816   5488196   0x53be44   vmm.ko

after:

> size kernel
      text      data       bss        dec         hex   filename
  23086310   2657033   5726464   31469807   0x1e030ef   kernel
> size vmm.ko
    text    data        bss        dec        hex   filename
  181732   30744   11587656   11800132   0xb40e44   vmm.ko

This is GENERIC-NODEBUG.

I grabbed the kernel symbol sizes, sorted by size, and stashed them here:

https://people.freebsd.org/~markj/cpuset/sizes.cpuset.txt
https://people.freebsd.org/~markj/cpuset/sizes.main.txt

At a glance, the largest increases are bdomain (267776 -> 1054208), stoppcbs (81920 -> 327680), cc_cpu (81920 -> 327680). The size increase isn't terrible as-is, though.

vmm has a couple of static arrays (hsave and vmxon_region) which appear to be responsible for most of the vmm.ko size increase.

Before:

> size kernel
      text      data       bss        dec         hex   filename
  23076646   1870505   4415872   29363023   0x1c00b4f   kernel
> size vmm.ko
    text    data       bss       dec        hex   filename
  177636   30744   5279816   5488196   0x53be44   vmm.ko

after:

> size kernel
      text      data       bss        dec         hex   filename
  23086310   2657033   5726464   31469807   0x1e030ef   kernel
> size vmm.ko
    text    data        bss        dec        hex   filename
  181732   30744   11587656   11800132   0xb40e44   vmm.ko

This is GENERIC-NODEBUG.

I grabbed the kernel symbol sizes, sorted by size, and stashed them here:

https://people.freebsd.org/~markj/cpuset/sizes.cpuset.txt
https://people.freebsd.org/~markj/cpuset/sizes.main.txt

At a glance, the largest increases are bdomain (267776 -> 1054208), stoppcbs (81920 -> 327680), cc_cpu (81920 -> 327680). The size increase isn't terrible as-is, though.

data 1870505 -> 2657033 gives +42%
bss 4415872 -> 5726464 gives + 29.5%

this is pretty atrocious and imo a showstopper for 1024.

what probably would make it tolerable:

  1. bump only to 512
  2. cc_cpu and stoppcbs are candidates for immediate removal -- they can land in struct pcpu and if need be can be reached through the cpuid_to_pcpu array. probably would be nice to abstract it with a macro so that one can CPUID_TO_PCPU(cpu) instead of directly derefing it.

vmm has a couple of static arrays (hsave and vmxon_region) which appear to be responsible for most of the vmm.ko size increase.

Also note technically today is the start of KBI freeze. But there is not enough time to get this in, I'm guessing it would be best to postponed said freeze. It prodded gjb about it.

In D36838#906027, @mjg wrote:

Before:

> size kernel
      text      data       bss        dec         hex   filename
  23076646   1870505   4415872   29363023   0x1c00b4f   kernel
> size vmm.ko
    text    data       bss       dec        hex   filename
  177636   30744   5279816   5488196   0x53be44   vmm.ko

after:

> size kernel
      text      data       bss        dec         hex   filename
  23086310   2657033   5726464   31469807   0x1e030ef   kernel
> size vmm.ko
    text    data        bss        dec        hex   filename
  181732   30744   11587656   11800132   0xb40e44   vmm.ko

This is GENERIC-NODEBUG.

I grabbed the kernel symbol sizes, sorted by size, and stashed them here:

https://people.freebsd.org/~markj/cpuset/sizes.cpuset.txt
https://people.freebsd.org/~markj/cpuset/sizes.main.txt

At a glance, the largest increases are bdomain (267776 -> 1054208), stoppcbs (81920 -> 327680), cc_cpu (81920 -> 327680). The size increase isn't terrible as-is, though.

data 1870505 -> 2657033 gives +42%
bss 4415872 -> 5726464 gives + 29.5%

this is pretty atrocious and imo a showstopper for 1024.

The only important number is the sum, which increases by 7%, and this increase can be reduced substantially by addressing a small number of cases.

what probably would make it tolerable:

  1. bump only to 512
  2. cc_cpu and stoppcbs are candidates for immediate removal -- they can land in struct pcpu and if need be can be reached through the cpuid_to_pcpu array. probably would be nice to abstract it with a macro so that one can CPUID_TO_PCPU(cpu) instead of directly derefing it.

Moving stoppcbs this way will break kgdb. Better to simply dynamically allocate the array once we know how many APs there are.

Objects in Mark's list that grew by over 1K (symbol, old size, new size, difference):

bdomain 267776 1054208 786432
group 49216 491680 442464
stoppcbs 81920 327680 245760
cc_cpu 81920 327680 245760
pmc_tf 49152 196608 147456
static_single_cpu_mask 8192 131072 122880
cam_doneqs 32768 131072 98304
vm_reserv_object_mtx 16384 65536 49152
ktls_domains 8512 33088 24576
vmspace0 2560 8800 6240
kernel_pmap_store 2256 8496 6240
sc_kts 2048 8192 6144
dpcpu_off 2048 8192 6144
cpuid_to_pcpu 2048 8192 6144
bootstacks 2048 8192 6144
__cpu_data 1536 6144 4608
nws_array 1024 4096 3072
cpu_apic_ids 1024 4096 3072
ktls_cpuid_lookup 512 2048 1536

urgh

kernel_pmap_store is of type pmap

struct pmap {
        struct mtx              pm_mtx;
        pml4_entry_t            *pm_pmltop;     /* KVA of top level page table */
        pml4_entry_t            *pm_pmltopu;    /* KVA of user top page table */
        uint64_t                pm_cr3;
        uint64_t                pm_ucr3;
        TAILQ_HEAD(,pv_chunk)   pm_pvchunk;     /* list of mappings in pmap */
        cpuset_t                pm_active;      /* active on cpus */
        enum pmap_type          pm_type;        /* regular or nested tables */
        struct pmap_statistics  pm_stats;       /* pmap statistics */
        struct vm_radix         pm_root;        /* spare page table pages */
        long                    pm_eptgen;      /* EPT pmap generation id */
        smr_t                   pm_eptsmr;
        int                     pm_flags;
        struct pmap_pcids       pm_pcids[MAXCPU];
        struct rangeset         pm_pkru;
};

as in +6K for every running process. this needs to be patched (move the array to the end and allocate as needed? EDIT: after cursory look this would be best served with real per-cpu but i don't know if that's available at that stage)

which prompts another note: instead of just checking obj sizes this clearly needs to check all structs for size changes, but I don't know a handy one-liner to do it. maybe bloaty can?

Thus I found struct pmap is embedded in struct vmspace and that has its own zone, making this cumbersome to fix. It also turns out the zone is NOFREE. what's up with that

sys/sys/_cpuset.h
43 ↗(On Diff #112038)

why is this separate, and even if it has to be, it should CTASSERT being not less than MAXCPU

Maybe ABI Compliance Checker (ABICC)? It's not really the intended use but it might serve the purpose.

here are size changes as found with ctfdump -t:

-_cpuset 32
+_cpuset 128
-bufdomain 33472
+bufdomain 131776
-cpu_group 64
+cpu_group 160
-cpu_offset 48
+cpu_offset 144
-cpuset 104
+cpuset 200
-hhook_head 176
+hhook_head 272
-hn_softc 1440
+hn_softc 1536
-hpet_softc 39112
-hpet_timer 1216
+hpet_softc 137416
+hpet_timer 4288
-hv_storvsc_sysctl 2072
+hv_storvsc_sysctl 8216
-iflib_ctx 1008
+iflib_ctx 1104
-ktls_domain_info 1064
+ktls_domain_info 4136
-nl_control 216
+nl_control 312
-osd_master 192
+osd_master 288
-pde_action 72
+pde_action 168
-pmap 2256
+pmap 8496
-radix_node_head 312
+radix_node_head 408
-rib_head 616
+rib_head 712
-rmlock 96
+rmlock 192
-rmslock_ipi 40
+rmslock_ipi 136
-smp_rendezvous_cpus_retry_arg 32
+smp_rendezvous_cpus_retry_arg 128
-storvsc_softc 5384
+storvsc_softc 17672
-taskqgroup 6192
+taskqgroup 24624
-topo_node 104
+topo_node 200
-unhop_ctl 112
+unhop_ctl 208
-vfs_op_barrier_ipi 40
+vfs_op_barrier_ipi 136
-vmbus_softc 33152
+vmbus_softc 131456
-vmspace 2560
+vmspace 8800

I ran into this handsome fella in sched_4bsd:

#ifdef SMP
/*
 * Per-CPU run queues
 */
static struct runq runq_pcpu[MAXCPU];
long runq_length[MAXCPU];

static cpuset_t idle_cpus_mask;
#endif

this should probably error out and suggest reducing maxcpu

sys/sys/_cpuset.h
43 ↗(On Diff #112038)

This is the size used for cpuset_t in userspace. The kernel always uses MAXCPU via the above.

As it happens there is one vm stat emitted by Cirrus-CI on each build:

mainhttps://cirrus-ci.com/task/5111357406707712?logs=test#L779vm.stats.vm.v_wire_count: 6433
MAXCPU=1024https://cirrus-ci.com/task/6123897498632192?logs=test#L779vm.stats.vm.v_wire_count: 6571

That's over 2% bump for nothing.

sysctl vm dump from these would be prudent to find out who is eating that memory.

that aside, @markj:

#define VM_RESERV_OBJ_LOCK_COUNT        MAXCPU

struct mtx_padalign vm_reserv_object_mtx[VM_RESERV_OBJ_LOCK_COUNT];

sounds like this should be allocated on boot instead? bonus points if it becomes numa-aware.

rebase after increasing userland cpuset size independently (76887e84be975698b14699d7d0dfb157d73e9990)

This revision now requires review to proceed.May 8 2023, 6:29 PM

Userland cpuset_t increase has been reapplied.

Had another look at v_wire_count in QEMU smoke boot just now:

Had another look at v_wire_count in QEMU smoke boot just now:

Some more low-hanging fruit: pmc_tf, static_single_cpu_mask (linuxkpi), cam_doneqs, ktls_domain_info, pending_locks (witness), vm_reserv_object_mtx. The differences account for 108 pages.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 3 2023, 9:46 PM
This revision was automatically updated to reflect the committed changes.