Page MenuHomeFreeBSD

nvme: Give reset a chance to undo failure
Needs ReviewPublic

Authored by imp on Mar 1 2024, 9:49 PM.
Tags
None
Referenced Files
F105579199: D44180.diff
Tue, Dec 17, 9:42 PM
Unknown Object (File)
Wed, Dec 11, 10:45 PM
Unknown Object (File)
Oct 27 2024, 12:33 AM
Unknown Object (File)
Sep 27 2024, 3:42 PM
Unknown Object (File)
Sep 25 2024, 7:46 PM
Unknown Object (File)
Sep 5 2024, 5:43 PM
Unknown Object (File)
Aug 16 2024, 7:28 AM
Unknown Object (File)
Aug 13 2024, 2:44 PM
Subscribers

Details

Reviewers
mav
chuck
chs
Summary

There are times when we may fail a drive, since it stops responding, but
never-the-less are able to reset the controller and bring it back on
line. While this won't always allow a fix, certain controllers have been
observed to enter a state where they stop replying so badly we fail
them, only to have them recover later with a reset (sometimes with
manual intervention prior to the reset to send a vendor specific FTL
reset command). Allowing reset to be tried in these cases allows us to
avoid a reboot.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 56389
Build 53277: arc lint + arc unit

Event Timeline

On failure we've already notified consumers that controller has failed. What will report it is back? And is there even a device to sent request IOCTL?

In D44180#1008994, @mav wrote:

On failure we've already notified consumers that controller has failed. What will report it is back? And is there even a device to sent request IOCTL?

Yea. I also hit this... And I'll need to rework more. There is a device passed into the ioctl, implicitly, otherwise we wouldn't have ctrlr...

mav@ is correct.

We need to do more here. If we were failed, we need to try the reset to see if that gets us out of the failed state. And if we do, we need call the new controller notification to build back up all the down-stream consumers that we've torn down. I'll do that as a separate review though once I get it tested out. The drives that are going wonkies take between 5 minutes and 2 weeks to trigger...