Page MenuHomeFreeBSD

nvme: Improve timeout action

Authored by imp on May 13 2024, 11:42 PM.
Referenced Files
Unknown Object (File)
Fri, Jul 19, 10:37 AM
Unknown Object (File)
Sat, Jul 13, 12:51 PM
Unknown Object (File)
Mon, Jun 24, 12:05 PM
Unknown Object (File)
May 18 2024, 8:46 PM
Unknown Object (File)
May 16 2024, 10:12 AM
Unknown Object (File)
May 15 2024, 6:32 PM
Unknown Object (File)
May 14 2024, 11:20 AM
Unknown Object (File)
May 14 2024, 7:45 AM



Today, we call the ISR process routine every time we hit the
timeout. This is wasteful and races the real ISR. It also can delay the
real ISR, resulting in more noise in the latency of requests than the
underlying hardware is delivering.

Instead, invent a soft deadline that starts at 1% into the actual
timeout for I/Os and admin transactions.. This is 300ms and 600ms
respectively. The oldest transaction is at the head of the queue, so if
the first one has been running more than those times, we start to call
_nvme_qpair_process_completions to see if we might be missing
interrupts. This avoids racing the ISR almost always, except when
something is getting stuck.

Sponsored by: Netflix

Test Plan

This is lightly tested on a bunch of good drives, and one cranky drive.

Sometimes I wonder why the I/O timeout is 30s and not like 3s.

We were trying to get PCI4 nvme drives and saw annoying latency issues with the hardware and (alas) software. This seems to fix it, but I have only a few miles on this code... But enough, I think in the normal path, to open it up to review.

Diff Detail

rG FreeBSD src repository
Lint Skipped
Tests Skipped
Build Status
Buildable 57698
Build 54586: arc lint + arc unit

Event Timeline

imp requested review of this revision.May 13 2024, 11:42 PM
imp added reviewers: chs, chuck, mav, jhb.

I see no problems, but I have difficulties to believe that timeout handlers 1-2 times per second per queue pair may have any visible effects. Also I am not happy to see second place where timeouts are calculated. And 99/100 also looks quite arbitrary.

This revision is now accepted and ready to land.May 14 2024, 12:44 AM

Yea. I didn't like it either.
I could move the timeouts into qpair. Then there'd be no more computing the values... I'd have to plumb the sysctl too... it would be cleaner. I could also put the soft timeout in there too. Then it wouldn't be so arbitrary.

I had counters showing that we hit the race a few times an hour on each queue when the system was under load. It surprised me too. It seems like it should not have. So far this seems better since we call the ISR as a fallback almost never now. I still need to run the detailed before/after latency tests to see tge final effects, but preliminary data suggests it's quite good... but i do agree the next layer of detail would help understand why I'm seeing improved results.

There's a missing bit here.
We need to also lock the ISR rather than try-lock it. Since we have the lock held only for a brief period of time, this should be OK.
There's a rare case where we decide to call the timeout ISR just as we get an interrupt (which is extremely unlikely, and it's OK
if the ISR is blocked for the time it takes to finish the completion processing.

I'll update once I've had a chance to test more.


I think I'll move this into a function to abstract it.
The NVMe spec doesn't have a 'typical I/O time' or a 'max I/O time' except when the predictable latency feature is enabled (not widely implemented, at least in the drives we have).
This is likely going to be arbitrary.. We can make it less arbitrary by keeping a maximum latency we observe, but I'm hesitant to do that for all I/Os. I need to think about that aspect.

After some reconsideration, I think is a better approach to this same issue.