Page MenuHomeFreeBSD

Panic if any APs fail to be released
AbandonedPublic

Authored by mhorne on Apr 19 2020, 2:29 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Apr 19, 10:42 AM
Unknown Object (File)
Wed, Apr 17, 7:24 AM
Unknown Object (File)
Mar 15 2024, 1:11 AM
Unknown Object (File)
Dec 28 2023, 10:28 AM
Unknown Object (File)
Dec 22 2023, 12:43 PM
Unknown Object (File)
Dec 20 2023, 5:48 AM
Unknown Object (File)
Nov 5 2023, 1:55 PM
Unknown Object (File)
Sep 7 2023, 2:18 PM
Subscribers

Details

Reviewers
None
Group Reviewers
riscv
Summary

We require that all CPUs started by cpu_mp_start() come online during
release_aps(), or we will face a fatal page fault later in boot. When
this error is encountered, panic immediately rather than waiting.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 30605
Build 28345: arc lint + arc unit

Event Timeline

What is the reason for a fatal page fault when some CPUs are not fired?
Normally it should start with a single CPU in that case.

In D24499#538813, @br wrote:

What is the reason for a fatal page fault when some CPUs are not fired?
Normally it should start with a single CPU in that case.359280

I observed the page fault when the scheduler tried to run a thread on an absent CPU. I wrote this patch a couple weeks ago, and @markj changed the behaviour since then in rS359280. Now, if we time-out the system will hang in smp_after_idle_runnable.

In D24499#538813, @br wrote:

What is the reason for a fatal page fault when some CPUs are not fired?
Normally it should start with a single CPU in that case.359280

I observed the page fault when the scheduler tried to run a thread on an absent CPU. I wrote this patch a couple weeks ago, and @markj changed the behaviour since then in rS359280. Now, if we time-out the system will hang in smp_after_idle_runnable.

I think we should fix rS359280 instead, because absent of a secondary CPU was not an issue before. If a new hardware has a bug (hardware or a firmware bug) in that area we still should boot on a single CPU with SMP kernel I think.
@markj what do you think?

In D24499#538963, @br wrote:
In D24499#538813, @br wrote:

What is the reason for a fatal page fault when some CPUs are not fired?
Normally it should start with a single CPU in that case.359280

I observed the page fault when the scheduler tried to run a thread on an absent CPU. I wrote this patch a couple weeks ago, and @markj changed the behaviour since then in rS359280. Now, if we time-out the system will hang in smp_after_idle_runnable.

I think we should fix rS359280 instead, because absent of a secondary CPU was not an issue before.

It must have been an issue before, since cpu_init_fdt() updates all_cpus regardless of whether the AP started successfully.

If a new hardware has a bug (hardware or a firmware bug) in that area we still should boot on a single CPU with SMP kernel I think.

In this case the user can set kern.smp.enabled=0 somehow, or the kernel can handle a known erratum by setting this tunable before attempting to start APs. If it is important to handle this without setting a tunable, cpu_init_fdt() should bail after a timeout and free bootstacks[cpu], so smp_after_idle_runnable() skips the failed CPU.

In D24499#538963, @br wrote:

I think we should fix rS359280 instead, because absent of a secondary CPU was not an issue before.

It must have been an issue before, since cpu_init_fdt() updates all_cpus regardless of whether the AP started successfully.

Definitely. I encountered the panic months ago, so who knows when or where that regression crept in.

If a new hardware has a bug (hardware or a firmware bug) in that area we still should boot on a single CPU with SMP kernel I think.

In this case the user can set kern.smp.enabled=0 somehow, or the kernel can handle a known erratum by setting this tunable before attempting to start APs. If it is important to handle this without setting a tunable, cpu_init_fdt() should bail after a timeout and free bootstacks[cpu], so smp_after_idle_runnable() skips the failed CPU.

How do other archs handle this issue, with the tunable? IMO the timeout in cpu_init_fdt() isn't a bad idea, it allows us to exclude CPUs that don't start but still proceed with booting the system, similar to what I have in D24497. I think it's a reasonable assumption that all CPUs that were properly started should come up during release_aps(), and adding this panic ensures that.

In D24499#538963, @br wrote:

If a new hardware has a bug (hardware or a firmware bug) in that area we still should boot on a single CPU with SMP kernel I think.

In this case the user can set kern.smp.enabled=0 somehow, or the kernel can handle a known erratum by setting this tunable before attempting to start APs. If it is important to handle this without setting a tunable, cpu_init_fdt() should bail after a timeout and free bootstacks[cpu], so smp_after_idle_runnable() skips the failed CPU.

How do other archs handle this issue, with the tunable? IMO the timeout in cpu_init_fdt() isn't a bad idea, it allows us to exclude CPUs that don't start but still proceed with booting the system, similar to what I have in D24497. I think it's a reasonable assumption that all CPUs that were properly started should come up during release_aps(), and adding this panic ensures that.

Most arches don't handle it at all as far as I know. For example amd64 will panic if init_secondary() doesn't get executed after 5s. arm64 has some handling for the case where psci_cpu_on() fails but otherwise assumes that the AP will come up.

At the present, we can do without this change.