Calling the completion routine from the timout was a good idea for the first
generation of missing interrupt problems that we had. However, since we've solve
the real root cause for them its need is much lessened. In addition, it's always
been racy, but as a 'last resort' it wasn't too bad.
However, under heavy load on some cloud platforms we see missing interrupts for
reasons unknown. In that environment, something the timeout was doing was
causing the interrupt to fire, causing a race between the two threads. To fix
this race would require an additional lock / unlock pair in the hot path.
To avoid that, we revamp how we do recovery. If the controller has failed, we
still do what we did before: we reset the controller. This also picks up the
hotplug case.
However, for the first timeout, we now send an innocuous command to the card (A
GET_FEATURES for a required feature). We start a new timeout on the original
request (but do nothing further). If that works, then we do nothing further as
the card is unwedge. If sending that command causes a timeout, we reset the
card. If the original command doesn't get a completion before its new timeout
exires, we either abort the command (if enable_abort = 1), or we reset the
card. If sending the abort times out, then we reset the card.
This should be more robust and free races. It remains to be seen if it actully
works.