Page MenuHomeFreeBSD

Before issing the REMOVE_DEVICE command to the firmware, make sure that all commands have completed.
ClosedPublic

Authored by imp on Feb 19 2020, 10:12 PM.
Tags
None
Referenced Files
Unknown Object (File)
Nov 8 2024, 5:46 AM
Unknown Object (File)
Sep 28 2024, 4:09 AM
Unknown Object (File)
Sep 18 2024, 12:09 AM
Unknown Object (File)
Sep 16 2024, 10:06 PM
Unknown Object (File)
Sep 16 2024, 10:06 PM
Unknown Object (File)
Sep 16 2024, 10:06 PM
Unknown Object (File)
Sep 16 2024, 10:05 PM
Unknown Object (File)
Sep 16 2024, 9:53 PM
Subscribers
None

Details

Summary

Before issing the REMOVE_DEVICE command to the firmware, make sure that all commands have completed.

When we're removing a target that's not a volume, advertise up the stack that it's actually gone, as opposed to having a transient selection error we should retry. Do this both in the action routine, and when we get a notification of an aborted command.

Add assert that all commands are complete to REMOVE_DEVICE completion routine. We shouldn't have and commands in the queue because we've waited for them all. Any commands that make it into our action routine after we mark the target in removal will complete immediately with an error.

It's not OK to complete commands before we send the REMOVE_DEVICE. Instead, make sure that all pending commands are complete before sending that. By trying to second guess the firmware here, we run the risk of completing commands twice, which leads to corruption.

This removes the forced completion of commands introduced in r218811. So it's a partial backout of that commit, but replaces it with a more robust mechanism.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/dev/mpr/mpr_sas.c
614 ↗(On Diff #68566)

This comment should likely explain that we can get completion records for cancelled I/Os after we get the completion record for the reset target. The Firmware doesn't make any guarantees. The docs are suggestive this shouldn't happen, but not air-tight.

sys/dev/mpr/mpr_sas.c
595 ↗(On Diff #68566)

This is a debug that should be reverted.

616 ↗(On Diff #68566)

what happens if one of these commands times out? There's no changes to the timeout code. Nor do we deal with 'timed out commands' in any special way.

617 ↗(On Diff #68566)

This message should be improved.

666 ↗(On Diff #68566)

This is a debug cleanup. Shouldn't be in the patch.

1973 ↗(On Diff #68566)

This change is because while we're in removal, we've committed to tearing down the device. If we return CAM_SEL_TIMEOUT, it will retry the command, which we don't really want.

2857 ↗(On Diff #68566)

This message is terrible. It should be better.

2859 ↗(On Diff #68566)

This message needs help and more information...

Use a slightly more conservative approach that doesn't do fast failures (yet).

So some notes.

  1. This doesn't solve the entire problem. It does, however, solve the device departure problem. I've audited the code paths, and this effectively shuts the door to new commands. It keeps us from scheduling any and it fails the ones that timeout and/or abort.
  2. We're effectively single stepping here, but there's other cases (where we lose and then regain PHY) that are transient and our recovery isn't great. We muddle through it decently well, but we should sideline all new I/O better and park cancelled I/O better when we are in recovery. This patch doesn't address that.
  3. Scott and I talked about doing a largish rewrite of the recovery code to defer it to a thread that could then single step things as appropriate and do the right things for edge triggered events.
  4. I didn't mark the commands as doomed. It didn't seem to be necessary to solve this part of the problem as they would all get a completion (or a timeout which would trigger a cancellation) eventually. We may need to mark the commands as doomed, but that needs to be part of the larger error recovery rework.
  5. The recovery from error path needs better logging, and although I tweaked logging a tiny bit, it needs a top-to-bottom audit to allow one to trace through all the recovery paths.

So this solves one problem (the drive departing completely), and it solves it in a way that I think is robust (I've traced through the different code paths and convinced myself that it is robust from the moment we get an event from the firmware saying the drive is gone). There's other issues it may not solve, but I'm actively hitting this issue in production and these patches do seem to solve the issue.

I have a second commit that's almost identical to this one that does the exact same thing to mps. I've only lightly tested it, but I have every reason to believe that it will be good.

Finally, sometimes I think we'd do well to have a mpi library that acts as a common set of code for these two drivers since their structure is quite similar, even if LSI/Avago/etc did some sketch things between MPI 2.5 and MPI 2.6. Such a rework would take a lot and require much testing.

So in spite of the other problems still lurking, I think this is good to go.

sys/dev/mpr/mpr_sas.c
624 ↗(On Diff #68646)

this is stray and will be deleted.

imp retitled this revision from Fail devices faster when we're removing them to Before issing the REMOVE_DEVICE command to the firmware, make sure that all commands have completed..Feb 22 2020, 4:47 AM
imp edited the summary of this revision. (Show Details)
imp added reviewers: scottl, ken, mav.
This revision was not accepted when it landed; it landed in state Needs Review.Feb 25 2020, 4:27 AM
This revision was automatically updated to reflect the committed changes.