nvme: Improve timeout action
AbandonedPublic
Actions

Authored by imp on May 13 2024, 11:42 PM.

Details

Reviewers

chs
chuck
mav
jhb

Summary

Today, we call the ISR process routine every time we hit the
timeout. This is wasteful and races the real ISR. It also can delay the
real ISR, resulting in more noise in the latency of requests than the
underlying hardware is delivering.

Instead, invent a soft deadline that starts at 1% into the actual
timeout for I/Os and admin transactions.. This is 300ms and 600ms
respectively. The oldest transaction is at the head of the queue, so if
the first one has been running more than those times, we start to call
_nvme_qpair_process_completions to see if we might be missing
interrupts. This avoids racing the ISR almost always, except when
something is getting stuck.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 57698
Build 54586: arc lint + arc unit

Event Timeline

imp created this revision.May 13 2024, 11:42 PM

Herald added a subscriber: dab. · View Herald TranscriptMay 13 2024, 11:42 PM

imp requested review of this revision.May 13 2024, 11:42 PM

Harbormaster completed remote builds in B57698: Diff 138511.May 13 2024, 11:42 PM

imp edited the test plan for this revision. (Show Details)May 13 2024, 11:45 PM

imp added reviewers: chs, chuck, mav, jhb.

I see no problems, but I have difficulties to believe that timeout handlers 1-2 times per second per queue pair may have any visible effects. Also I am not happy to see second place where timeouts are calculated. And 99/100 also looks quite arbitrary.

This revision is now accepted and ready to land.May 14 2024, 12:44 AM

Yea. I didn't like it either.
I could move the timeouts into qpair. Then there'd be no more computing the values... I'd have to plumb the sysctl too... it would be cleaner. I could also put the soft timeout in there too. Then it wouldn't be so arbitrary.

I had counters showing that we hit the race a few times an hour on each queue when the system was under load. It surprised me too. It seems like it should not have. So far this seems better since we call the ISR as a fallback almost never now. I still need to run the detailed before/after latency tests to see tge final effects, but preliminary data suggests it's quite good... but i do agree the next layer of detail would help understand why I'm seeing improved results.

There's a missing bit here.
We need to also lock the ISR rather than try-lock it. Since we have the lock held only for a brief period of time, this should be OK.
There's a rare case where we decide to call the timeout ISR just as we get an interrupt (which is extremely unlikely, and it's OK
if the ISR is blocked for the time it takes to finish the completion processing.

I'll update once I've had a chance to test more.

imp added inline comments.Jul 16 2024, 3:07 PM

sys/dev/nvme/nvme_qpair.c
1106	I think I'll move this into a function to abstract it. The NVMe spec doesn't have a 'typical I/O time' or a 'max I/O time' except when the predictable latency feature is enabled (not widely implemented, at least in the drives we have). This is likely going to be arbitrary.. We can make it less arbitrary by keeping a maximum latency we observe, but I'm hesitant to do that for all I/Os. I need to think about that aspect.