Instead, dynamically allocate a page for the boot stack of each AP when
starting them up, like we do on x86. This shrinks the bss by
MAXCPU*KSTACK_PAGES pages, which corresponds to 4MB on arm64 and 256KB
on riscv. mpentry is slightly simplified as well.
Duplicate the logic used on x86 to free the bootstacks, by using a
sysinit to wait for APs to switch to a thread.
While here, mark some static MD variables as such.