Page MenuHomeFreeBSD

nvme: Use shared timeout rather than timeout per transaction
ClosedPublic

Authored by imp on Feb 11 2021, 1:02 AM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Apr 12, 1:24 PM
Unknown Object (File)
Wed, Apr 10, 10:37 PM
Unknown Object (File)
Wed, Apr 10, 10:37 PM
Unknown Object (File)
Wed, Apr 10, 10:37 PM
Unknown Object (File)
Wed, Apr 10, 10:37 PM
Unknown Object (File)
Wed, Apr 10, 10:21 PM
Unknown Object (File)
Wed, Apr 10, 10:21 PM
Unknown Object (File)
Wed, Apr 10, 10:21 PM
Subscribers

Details

Summary

Keep track of the approximate time commands are 'due' and the next
deadline for a command. twice a second, wake up to see if any commands
have entered timeout. If so, quiessce and then enter a recovery mode
half the timeout further in the future to allow the ISR to
complete. Once we exit recovery mode, send a command with no negative
effect to the drive to ensure that any commands that completed after
we last poll are handled in a timely manner.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

imp requested review of this revision.Feb 11 2021, 1:02 AM
sys/dev/nvme/nvme_private.h
317

"dying" makes sense, but perhaps the intent would be clearer if the variable name more close matched the function name. is_destruct would be the most on point, but is_destructing or is_destroying might work as well.

sys/dev/nvme/nvme_qpair.c
542

Perhaps update comment to "qpair is recovering, likely because a controller reset is in"

1035

Is there a scenario in which deadline isn't initialized? For example, if the timer is armed and the current request is an AER (i.e. req->timeout == false). Will a second AER set qpair->deadline to whatever value is in the previous AER's timeout? If so, does it matter?

sys/dev/nvme/nvme_private.h
317

Dying is used elsewhere in the kernel for this purpose, but I'd be interested to hear what others have to say.

sys/dev/nvme/nvme_qpair.c
542

Good point

1035

in practice, this hasn't been an issues, but the intent here is that we want to set the deadline to the oldest item still in the outstanding request list with a timeout.

In debug writes I've seen this scenario printed, but didn't understand it. I'd see the timeout flip between SBT_MAX and sane values on a time scale of a few hours. And this scenario (or a similar one) would explain it. We recover fairly quickly, but that's in a Netflix workload where we have ~1crapton of outstanding I/O at any time and we seem to rotate through them quickly...

So this is a very good question.

I like the general direction. I also was thinking about something like this to avoid callout operations on every I/O.

I haven't looked very close, but have some comments inline. Plus I am not sure we really need deadline variable. I think we may just traverse through the outstanding_tr queue and stop when timeout of some command is still ahead. In most cases it would mean just looking on the first tracker. I don't think we should really look deeper, since I am not sure it is specified anywhere that controller should fetch and start processing all command in parallel. And if it doesn't and commands are dependent, then we probably should not time out following command before the earlier one(s). Plus it makes the code more simple. ;)

sys/dev/nvme/nvme_qpair.c
1014

There is callout_schedule() for rescheduling with less arguments.

1036

Why to have variable if it is always the same?

1042

Should we really postulate that we have no different timeouts? We already have two actually: normal and 1s for polling.

Update with comments

  • remove trying to cache the right deadline (just walk through)
  • allow reset to confirm if there's really hotplug event

Include more feedback from mav@

imp marked 2 inline comments as done.Mar 12 2021, 2:08 AM
imp added inline comments.
sys/dev/nvme/nvme_qpair.c
1035

btw, good eye. this was the source of the infinite reset bug we saw in testing as soon as one of the AER's fired.

1042

Yea, the polling is set to he 1/2 of the shortest reasonable timeout. We have user specified, 1s and 30s timeouts. So I selected hz / 2...

imp marked an inline comment as done.Mar 13 2021, 5:16 PM
imp added inline comments.
sys/dev/nvme/nvme_qpair.c
1042

Originally, I'd wanted to have a single timeout that was half of the longest scheduled timeout. But having it fixed at 1/2s was easier to code. I'm not entirely sure the benefit from longer polling periods, but I'll keep that in mind if I start to see contention introduced by the 1/2s timeout.

1255

Now that we are manipulating the tr entries here w/o a lock and when we arrive here via timeout, we could be racing the submission process which adjusts the tr w/o the qpair lock and other threads completing other trs which remove things from the list (while we won't crash, we may stop prematurely and/or try to concurrently fail an item that's otherwise completing). Solving these races requires some careful thought.

I've had one crash in the three weeks we've been running these patches fleet wide at netflix that points to this issue.

sys/dev/nvme/nvme_private.h
180

This seems to be unused now.

sys/dev/nvme/nvme_qpair.c
978

If noting is complete, why is it a missed interrupt?

1144

What's the point of this condition? I suppose the goal of below assignment is to avoid false reaction on tracker that still inside busdma, that may take some time, but then the assignment should also be done.

sys/dev/nvme/nvme_private.h
180

It is

sys/dev/nvme/nvme_qpair.c
978

If we find work done in the timeout that we should have found in the ISR, then that's a classic symptom of a missed interrupt. We should have had an interrupt to notify us. Instead we timed out and found work.... normally we'd expect to find things are stuck and have to cancel / reset to fix. Though to be pedantic, it's only a missed interrupt if the completions we find are past their deadline and not some other work.

1144

I think you may be right this is unwise here. I need to walk through the race I found in this code to see if this even makes sense anymore

Rebase and update to latest netflix code

imp marked 3 inline comments as done.Sep 17 2021, 10:20 PM

AH, one more updated need that I overlooked.

sys/dev/nvme/nvme_private.h
180

OK. This has been updated to use it.

sys/dev/nvme/nvme_qpair.c
978

I'll reword. I don't think what I said here is quite right in hindsight.

Update printfs to be better (per mav@'s critique)
Remove comment that I don't think is true anymore (but was true in earlier code)

This revision was not accepted when it landed; it landed in state Needs Review.Sep 24 2021, 12:15 AM
This revision was automatically updated to reflect the committed changes.