Page MenuHomeFreeBSD

defer acpi_ged() interrupt setup to late in boot on some platforms
AbandonedPublic

Authored by gallatin on Oct 9 2023, 9:45 PM.
Tags
None
Referenced Files
Unknown Object (File)
Oct 3 2024, 2:40 PM
Unknown Object (File)
Oct 2 2024, 8:28 PM
Unknown Object (File)
Oct 2 2024, 8:28 PM
Unknown Object (File)
Oct 2 2024, 8:13 PM
Unknown Object (File)
Oct 2 2024, 6:54 AM
Unknown Object (File)
Oct 1 2024, 12:17 AM
Unknown Object (File)
Sep 30 2024, 10:45 PM
Unknown Object (File)
Sep 30 2024, 10:11 AM
Subscribers

Details

Summary

We have an arm64 server that encounters an ACPI GED error at boot. This causes the system to livelock and hang at boot, spewing "AcpiOsExecute: failed to enqueue task, consider increasing the debug.acpi.max_tasks tunable".

This is happening because on arm64, we do not have EARLY_AP_STARTUP, so we are running on a single core when the driver attaches. The interrupt is level triggered, and is unmasked in the interrupt controller when the interrupt is setup. After the interrupt is setup, we live-lock in an interrupt storm because the handler (which is run as part of AcpiEvaluateObject()) is in charge of ack'ing the interrupt, but is not run until after the taskqueue is scheduled. So it is never run.

The ideal solution would be to simply call AcpiEvaluateObject() directly. However, @jhb tells me that this is not safe in an interrupt context, as some GED events may sleep.

The workaround he helped me come up with is to defer the interrupt setup until after we have setup SMP.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 53908

Event Timeline

gallatin created this revision.

Do we know what types of GED events might sleep? I'm not sure this will work when there's only a single CPU as it will be identical to the pre-AP startup case.

Do we know what types of GED events might sleep? I'm not sure this will work when there's only a single CPU as it will be identical to the pre-AP startup case.

Yes, this is my fear as well.

In fact, not doing it directly leads to duplication of events. Eg, we see 3 APEI corrected errors with the same uuid when doing it this way, vs a single one with my smaller patch that just directly calls AcpiEvaluateObject() from acpi_ged_intr()

The duplicate events would be fixed by my other suggestion of using a dedicated struct task instead of calling AcpiOsExecute which allocates and schedules a new struct task each time.

My worry is that GED is too general a thing. It means "go run some random firmware-provided bit of AML that can do God knows what" when an interrupt occurs.

After digging in the spec for a bit, the description there for GED (5.6.9) is not very clear. There is one requirement for event handling in general for interrupts by the OS (OSPM) is to leave the interrupt source disabled (e.g. the GIC pin masked) until the ACPI control method has been executed (section 5.6.4 that talks about Generic Event Handling). The current acpi_ged driver definitely doesn't do that, and our interrupt model doesn't have a good way to cope with that (we re-enable the GIC pin after the ithread handler completes). We might just have to run the method synchronously and hope for the best. The spec doesn't mandate that these handlers are safe, but it does suggest that they should invoke Notify() from the AML for non-trivial event reporting, so it may be that these are safe. The _EVT handler is required by the spec to clear the interrupt so it doesn't keep firing.

In D42141#961666, @jhb wrote:

The duplicate events would be fixed by my other suggestion of using a dedicated struct task instead of calling AcpiOsExecute which allocates and schedules a new struct task each time.

My worry is that GED is too general a thing. It means "go run some random firmware-provided bit of AML that can do God knows what" when an interrupt occurs.

After digging in the spec for a bit, the description there for GED (5.6.9) is not very clear. There is one requirement for event handling in general for interrupts by the OS (OSPM) is to leave the interrupt source disabled (e.g. the GIC pin masked) until the ACPI control method has been executed (section 5.6.4 that talks about Generic Event Handling). The current acpi_ged driver definitely doesn't do that, and our interrupt model doesn't have a good way to cope with that (we re-enable the GIC pin after the ithread handler completes). We might just have to run the method synchronously and hope for the best. The spec doesn't mandate that these handlers are safe, but it does suggest that they should invoke Notify() from the AML for non-trivial event reporting, so it may be that these are safe. The _EVT handler is required by the spec to clear the interrupt so it doesn't keep firing.

I created https://reviews.freebsd.org/D42158 for handling the interrupt directly. If you agree with that approach, I'll abandon this change. Thank you!

I'm abandoning this in favor of handling events directly (https://reviews.freebsd.org/D42158)