Page MenuHomeFreeBSD

hdac: Handle interrupts racing with device suspend
ClosedPublic

Authored by markj on Jan 31 2022, 9:00 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Nov 29, 10:16 PM
Unknown Object (File)
Tue, Nov 19, 2:43 AM
Unknown Object (File)
Mon, Nov 18, 11:33 PM
Unknown Object (File)
Wed, Nov 13, 6:30 PM
Unknown Object (File)
Oct 25 2024, 7:11 PM
Unknown Object (File)
Oct 12 2024, 2:36 PM
Unknown Object (File)
Oct 12 2024, 2:36 PM
Unknown Object (File)
Oct 1 2024, 3:42 AM
Subscribers

Details

Summary

A couple of problems were causing occasional hangs during suspend to S3
on my laptop. The hang is caused by the while loop in hdac_one_intr(),
which loops forever if the RIRBSTS read returns all ones following entry
to the D3 state.

First, hdac_intr_handler() loads the interrupts status register before
locking the softc. This can race with hdac_suspend()->hdac_reset() such
that the reset occurs between a load of the status register and the call
to hdac_one_intr(). In this case, the load from RIRBSTS returns 0xff
and the following loop never terminates.

Second, if the interrupt handler is running after BUS_SUSPEND_CHILD()
powers down the device (e.g., because it was blocking on the softc
mutex), reading INTSTS can return all ones. Explicitly check for that
and simply bail in that case. I note that the Linux HDA driver does
this as well.

I'm not sure if this is the right solution, but it's simple and seems to
be robust in my testing. In particular, I don't see any interfaces that
allow one to "drain" a suspended interrupt handler.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.Jan 31 2022, 9:00 PM

@markj , hmm, maybe 82a5a2752724 should be extended to non-shared interrupts to prevent races between an interrupt handler and a suspend method?

In D34117#771550, @avg wrote:

@markj , hmm, maybe 82a5a2752724 should be extended to non-shared interrupts to prevent races between an interrupt handler and a suspend method?

Ahh, I didn't realize that intr_handler_barrier() blocks until the handler finishes running, so that might be the primitive I want instead.

Would the idea then be to call BUS_SUSPEND_INTR() for all resources of type SYS_RES_IRQ? Does RID 0 correspond to a legacy interrupt line for PCI devices?

Yes, rid 0 is for legacy PCI interrupts and rids >= 1 are for MSI / MSI-X.
So, indeed, a proper barrier would require iterating over all active interrupt resources.

Not sure if that's the best approach, but discovering / fixing drivers one by one does not sound better.

In D34117#771564, @avg wrote:

Yes, rid 0 is for legacy PCI interrupts and rids >= 1 are for MSI / MSI-X.
So, indeed, a proper barrier would require iterating over all active interrupt resources.

Ok, I'll try this.

Not sure if that's the best approach, but discovering / fixing drivers one by one does not sound better.

I still tend to think it's better to avoid the unlocked intsts read in hdac_intr_handler().

I still tend to think it's better to avoid the unlocked intsts read in hdac_intr_handler().

Probably.
Perhaps the thinking was that nothing else modifies HDAC_INTSTS, so it should be safe to read it outside the lock and, thus, reduce the lock hold time.

In D34117#771564, @avg wrote:

Yes, rid 0 is for legacy PCI interrupts and rids >= 1 are for MSI / MSI-X.
So, indeed, a proper barrier would require iterating over all active interrupt resources.

So this fails for at least nvme: pci_resume_child() un-suspends interrupts after resuming devices, but nvme_pci_resume() depends on having working admin interrupts, in nvme_ctrlr_identify(). It's not obvious to me exactly why we need to resume the driver before enabling interrupts. Reording fixes the problem on my laptop at least.

In D34117#771564, @avg wrote:

Yes, rid 0 is for legacy PCI interrupts and rids >= 1 are for MSI / MSI-X.
So, indeed, a proper barrier would require iterating over all active interrupt resources.

So this fails for at least nvme: pci_resume_child() un-suspends interrupts after resuming devices, but nvme_pci_resume() depends on having working admin interrupts, in nvme_ctrlr_identify(). It's not obvious to me exactly why we need to resume the driver before enabling interrupts. Reording fixes the problem on my laptop at least.

For reference, the change I'm testing is here: https://reviews.freebsd.org/D34121

So this fails for at least nvme: pci_resume_child() un-suspends interrupts after resuming devices, but nvme_pci_resume() depends on having working admin interrupts, in nvme_ctrlr_identify(). It's not obvious to me exactly why we need to resume the driver before enabling interrupts. Reording fixes the problem on my laptop at least.

My thinking was that we should have a symmetry with suspend: if somehow an interrupt happens before a device is fully resumed, it should not be delivered.
Thinking about it now, that actually seems to make sense for shared PCI interrupts: the interrupt could be for an already resumed device and a resuming device might not be ready to handle it just yet.
But it does not make sense for exclusive interrupts: how the hardware can generate an interrupt if it's not sufficiently resumed yet?

Looks like this is more complicated than I hoped.
Some finer grained control is needed.

But it does not make sense for exclusive interrupts: how the hardware can generate an interrupt if it's not sufficiently resumed yet?

You assume a binary condition in the resume function. Maybe the hardware needs interrupts to properly resume the device after restoring a few registers?
We already restore the power state before calling resume (though we do remove the power after suspend returns, if any).

If you get interrupts during suspend, you'll have to sort those out (just like you would a network driver getting a packet). I don't think that turning off interrupts is going to be a good time for many drivers that may want to use the interrupts during suspend (to maybe send a command or two to flush the state, etc). I think it's a really bad idea to have the pci bus.

In D34117#771775, @imp wrote:

But it does not make sense for exclusive interrupts: how the hardware can generate an interrupt if it's not sufficiently resumed yet?

You assume a binary condition in the resume function. Maybe the hardware needs interrupts to properly resume the device after restoring a few registers?

I do not assume that, no.
But the binary model would be easier to handle at a central or a bus level.
Otherwise we need a way for a driver to explicitly declare that it's ready for interrupts.

In D34117#771776, @avg wrote:
In D34117#771775, @imp wrote:

But it does not make sense for exclusive interrupts: how the hardware can generate an interrupt if it's not sufficiently resumed yet?

You assume a binary condition in the resume function. Maybe the hardware needs interrupts to properly resume the device after restoring a few registers?

I do not assume that, no.
But the binary model would be easier to handle at a central or a bus level.
Otherwise we need a way for a driver to explicitly declare that it's ready for interrupts.

The nvme driver is always ready for interrupts, and any suspend/resume races with interrupts are a bug in the nvme driver.
Why isn't that a viable model for other drivers?

For shared pci drivers, they are level interrupts: they are asserted until the condition goes away. For drivers that are sharing those interrupts, it may make sense to suppress the interrupts until all the devices that have drivers attached are resumed. For PCIe, though, they are a one shot that's easily cleared. The interrupts shouldn't start firing after resume until a mask register is properly set , which should be done with proper locking in the resume routine.

For suspend, the usual driver is going to want to do things to the device and then shut it down. Interrupts are possible here to clock along commands that may need to be sent to the drive (nvme does this to properly shutdown the nvme device by executing commands that delete all the I/O queues, resume does a full bring up of the device, including creating those queues). If interrupts were disallowed, we'd need two different code paths for attach and suspend/resume which is, imho, even worse.

@imp , I am not sure if we are disagreeing about anything.
I just explained why the current code the way it is (for shared interrupts).
And stated that the non-shared interrupts case is different.

I am not sure what to do here. Yes, this change is a bit inelegant, but it fixes a known bug affecting multiple users and laptops, and would be nice to have in 13.1. Some general infrastructure to suspend/resume interrupts at the appropriate per-driver point might be useful but I'm not aware of any other instances of the same problem, and I do not want to generalize without one or two more examples of the same problem in other drivers.

I don't see a problem from this patch, if it helps (assuming HDAC_INTSTS still returns sane zero when HDAC_RIRBSTS turns mentioned 0xff, otherwise I am not sure what it fixes). May be I am missing some other motivation Andriy had when written this part, but if we assume that all/most modern hardware use non-shared MSI interrupts, there should be almost no cases when interrupt got delivered but HDAC_INTSTS is still zero, and so taking lock before the read should only slightly increase lock hold time for two register reads. We pay that price in much more busy drivers than sound.

This revision is now accepted and ready to land.Mar 15 2022, 8:21 PM

This change looks good to me as well. My comments about bus level stuff lead me off into the weeds a bit.
We likely need to enhance newbus, however, to cope with the card being missing (the 0xfffffff check is for
that) or otherwise in a state that the driver can't interact with the card (like with a DPC initiated renegotiation)
but those are big ticket items that are hard. This looks like a good way to not loop forever and to not race
the rest of the driver (and to cope with things going away with an interrupt service routine still active and
an interrupt coming in).

the card being missing (the 0xfffffff check is for that)

If/when newbus is enhanced to handle devices being missing would the 0xfffffff here then be redundant / useless? If so is it worth a comment here?

I've somehow missed intsts != 0xffffffff part when I looked on this patch, I though it is about locking. Is definitely explains why it helps. On downside I can only think that in theory there can be a controller with all 30 streams supported and all requiring attention same time. Though that may be quite unrealistic.

the card being missing (the 0xfffffff check is for that)

If/when newbus is enhanced to handle devices being missing would the 0xfffffff here then be redundant / useless? If so is it worth a comment here?

That's unclear. This check would become something else since you have to handle the case where the card becomes unavailable while in the ISR.
I think that we have a full tree audit ahead of us anyway, so I'm not sure a comment would help for sure.
The locking stuff likely doesn't matter for making this case better, though it might.
I agree with mav@ that even on a loaded system, chances are good it wouldn't matter (or at least to the degree that we'd want to see data from a loaded system before we'd optimize things).

In D34117#783461, @imp wrote:

the card being missing (the 0xfffffff check is for that)

If/when newbus is enhanced to handle devices being missing would the 0xfffffff here then be redundant / useless? If so is it worth a comment here?

That's unclear. This check would become something else since you have to handle the case where the card becomes unavailable while in the ISR.
I think that we have a full tree audit ahead of us anyway, so I'm not sure a comment would help for sure.
The locking stuff likely doesn't matter for making this case better, though it might.

Without the locking change, we would also have to check for an all-ones read from the RIRB status, in hdac_one_intr().

I agree with mav@ that even on a loaded system, chances are good it wouldn't matter (or at least to the degree that we'd want to see data from a loaded system before we'd optimize things).