Change Details

Optimize timeout code based on three observations. (1) The tr queues are sorted in order of submission, so the first one that could time out is the first "real" one on the list. (2) Timeouts for a given queue are all the same length (well, except at startup, where timeout doesn't matter, and when you change it at runtime, where timeouts will still happen eventually and the difference isn't worth optimizing for). (3) Calling the ISR races the real ISR and we should avoid that better. So now, after checking to see if the card is there and working, the timeout routine scans the pending tracker list until it finds a non-AER tracker. If that transaction isn't marked, it will mark it and return. If it is marked, it knows that at least 1/2s has passed which is "a really long time" in the NVMe world, so it will run the ISR to see if we've missed an interrupt (which happens sometimes due to bugs / new configurations). We'll then scan the list for any transactions whose deadline has passed. If we find them, we'll either abort the transaction (not enabled by default) or we'll reset the controller (the default). We stop scanning when we find a pending transaction whose deadline hasn't passed yet, which we didn't used to do. This should move the timeout routine to touching hardware only when it's really necessary (the card there status can't be avoided, and now we avoid racing the ISR all the time except when there's trouble). This eliminates extra completion scanning except for real problems on almost all the hardware we have, and does occasionally race hardware on one controller we have that sometimes has workload induced latency issues (but even there we go from an extra call to the scanner twice a second to an extra couple of calls a day). This also avoid weird latency spikes on at least one controller where the timeout would race the interrupt, the interrupt handler couldn't get the lock and so we'd wind up not running the completion scanner for that interrupt (nor in the other routine that had the lock). A prior commit fixed that by always locking in the ISR, and this change makes that lock much less contended. There was also some minor code motion to make all of the above flow more nicely for the reader. Sponsored by: Netflix

Optimize timeout code based on three observations. (1) The tr queues are sorted in order of submission, so the first one that could time out is the first "real" one on the list. (2) Timeouts for a given queue are all the same length (well, except at startup, where timeout doesn't matter, and when you change it at runtime, where timeouts will still happen eventually and the difference isn't worth optimizing for). (3) Calling the ISR races the real ISR and we should avoid that better. So now, after checking to see if the card is there and working, the timeout routine scans the pending tracker list until it finds a non-AER tracker. If that transaction ie deadline hasn't marked passed, we return, it will mark it anddoing nothing return. If it is markedfurther. Otherwise, it knows that at least 1/2s has passed which iswe call poll completions and then process the list "a really long time" in the NVMe world, so it will run the ISR to see if we've missed an interrupt (which happens sometimes due to bugs / new configurations). We'll then scan the list for any transactions whose deadline has passed. If we find them, we'll either abort the transaction (not enabled by default) or we'll reset the controller (the default). We stop scanning when we find a pending transaction whose deadline hasn't passed yet, which we didn't used to dolooking for timed out items. This should move the timeout routine to touching hardware only when it's really necessary (the card there status can't be avoided,. and now weIt thus avoids racing the normal ISR, while still avoid racing the ISR all the time except when there's trouble). This eliminates extra completion scanning except for real problems on almost all the hardware we have, and does occasionally race hardware on one controller we have that sometimes has workload induced latency issues (but even there we go from an extra call to the scanner twice a second to an extra couple of calls a day). This also avoid weird latency spikes on at least one controller where the timeout would race the interrupt, the interrupt handler couldn't get the lock and so we'd wind up not running the completion scanner for that interrupt (nor in the other routine that had the lock). A prior commit fixed that by always locking in the ISR, and this change makes that lock much less contendedtimig out stuck transactions quickly enough. There was also some minor code motion to make all of the above flow more nicely for the reader. Sponsored by: NetflixWhen interrupts aren't working at all, then this will increase latency somewhat. But when interrupts aren't working at all, there's bigger problems and we should poll quite often in that case. That will be handled in future commits.