Optimize timeout code based on three observations.
(1) The tr queues are sorted in order of submission, so the first one
that could time out is the first "real" one on the list.
(2) Timeouts for a given queue are all the same length (well, except
at startup, where timeout doesn't matter, and when you change it
at runtime, where timeouts will still happen eventually and the
difference isn't worth optimizing for).
(3) Calling the ISR races the real ISR and we should avoid that better.
So now, after checking to see if the card is there and working, the
timeout routine scans the pending tracker list until it finds a non-AER
tracker. If that transaction ie deadline hasn't marked passed, we return, it will mark it anddoing nothing
return. If it is markedfurther. Otherwise, it knows that at least 1/2s has passed which iswe call poll completions and then process the list
"a really long time" in the NVMe world, so it will run the ISR to see if
we've missed an interrupt (which happens sometimes due to bugs / new
configurations). We'll then scan the list for any transactions whose
deadline has passed. If we find them, we'll either abort the transaction
(not enabled by default) or we'll reset the controller (the default). We
stop scanning when we find a pending transaction whose deadline hasn't
passed yet, which we didn't used to dolooking for timed out items.
This should move the timeout routine to touching hardware only when it's
really necessary (the card there status can't be avoided,. and now weIt thus avoids racing the normal ISR, while still
avoid racing the ISR all the time except when there's trouble). This
eliminates extra completion scanning except for real problems on almost
all the hardware we have, and does occasionally race hardware on one
controller we have that sometimes has workload induced latency issues
(but even there we go from an extra call to the scanner twice a second
to an extra couple of calls a day). This also avoid weird latency spikes
on at least one controller where the timeout would race the interrupt,
the interrupt handler couldn't get the lock and so we'd wind up not
running the completion scanner for that interrupt (nor in the other
routine that had the lock). A prior commit fixed that by always locking
in the ISR, and this change makes that lock much less contendedtimig out stuck transactions quickly enough.
There was also some minor code motion to make all of the above flow more
nicely for the reader.
Sponsored by: NetflixWhen interrupts aren't working at all, then this will increase latency
somewhat. But when interrupts aren't working at all, there's bigger
problems and we should poll quite often in that case. That will be
handled in future commits.