Before issing the REMOVE_DEVICE command to the firmware, make sure that all commands have completed.
ClosedPublic
Actions

Authored by imp on Feb 19 2020, 10:12 PM.

Details

Reviewers

slm
scottl
ken
mav

Commits

rS358308: Before issing the REMOVE_DEVICE command to the firmware, make sure that all

Summary

Before issing the REMOVE_DEVICE command to the firmware, make sure that all commands have completed.

When we're removing a target that's not a volume, advertise up the stack that it's actually gone, as opposed to having a transient selection error we should retry. Do this both in the action routine, and when we get a notification of an aborted command.

Add assert that all commands are complete to REMOVE_DEVICE completion routine. We shouldn't have and commands in the queue because we've waited for them all. Any commands that make it into our action routine after we mark the target in removal will complete immediately with an error.

It's not OK to complete commands before we send the REMOVE_DEVICE. Instead, make sure that all pending commands are complete before sending that. By trying to second guess the firmware here, we run the risk of completing commands twice, which leads to corruption.

This removes the forced completion of commands introduced in r218811. So it's a partial backout of that commit, but replaces it with a more robust mechanism.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 29523
Build 27390: arc lint + arc unit

Event Timeline

imp created this revision.Feb 19 2020, 10:12 PM

Herald added a reviewer: slm. · View Herald TranscriptFeb 19 2020, 10:12 PM

Harbormaster completed remote builds in B29504: Diff 68566.Feb 19 2020, 10:12 PM

imp added inline comments.Feb 19 2020, 11:29 PM

sys/dev/mpr/mpr_sas.c
628	This comment should likely explain that we can get completion records for cancelled I/Os after we get the completion record for the reset target. The Firmware doesn't make any guarantees. The docs are suggestive this shouldn't happen, but not air-tight.

imp added inline comments.Feb 19 2020, 11:49 PM

sys/dev/mpr/mpr_sas.c
595	This is a debug that should be reverted.
630	what happens if one of these commands times out? There's no changes to the timeout code. Nor do we deal with 'timed out commands' in any special way.
631	This message should be improved.
678	This is a debug cleanup. Shouldn't be in the patch.
1988	This change is because while we're in removal, we've committed to tearing down the device. If we return CAM_SEL_TIMEOUT, it will retry the command, which we don't really want.
2882	This message is terrible. It should be better.
2884	This message needs help and more information...

Use a slightly more conservative approach that doesn't do fast failures (yet).

Harbormaster completed remote builds in B29523: Diff 68646.Feb 21 2020, 6:32 PM

So some notes.

This doesn't solve the entire problem. It does, however, solve the device departure problem. I've audited the code paths, and this effectively shuts the door to new commands. It keeps us from scheduling any and it fails the ones that timeout and/or abort.
We're effectively single stepping here, but there's other cases (where we lose and then regain PHY) that are transient and our recovery isn't great. We muddle through it decently well, but we should sideline all new I/O better and park cancelled I/O better when we are in recovery. This patch doesn't address that.
Scott and I talked about doing a largish rewrite of the recovery code to defer it to a thread that could then single step things as appropriate and do the right things for edge triggered events.
I didn't mark the commands as doomed. It didn't seem to be necessary to solve this part of the problem as they would all get a completion (or a timeout which would trigger a cancellation) eventually. We may need to mark the commands as doomed, but that needs to be part of the larger error recovery rework.
The recovery from error path needs better logging, and although I tweaked logging a tiny bit, it needs a top-to-bottom audit to allow one to trace through all the recovery paths.

So this solves one problem (the drive departing completely), and it solves it in a way that I think is robust (I've traced through the different code paths and convinced myself that it is robust from the moment we get an event from the firmware saying the drive is gone). There's other issues it may not solve, but I'm actively hitting this issue in production and these patches do seem to solve the issue.

I have a second commit that's almost identical to this one that does the exact same thing to mps. I've only lightly tested it, but I have every reason to believe that it will be good.

Finally, sometimes I think we'd do well to have a mpi library that acts as a common set of code for these two drivers since their structure is quite similar, even if LSI/Avago/etc did some sketch things between MPI 2.5 and MPI 2.6. Such a rework would take a lot and require much testing.

So in spite of the other problems still lurking, I think this is good to go.

sys/dev/mpr/mpr_sas.c
624	this is stray and will be deleted.

imp retitled this revision from Fail devices faster when we're removing them to Before issing the REMOVE_DEVICE command to the firmware, make sure that all commands have completed..Feb 22 2020, 4:47 AM

imp edited the summary of this revision. (Show Details)

imp added reviewers: scottl, ken, mav.

This revision was not accepted when it landed; it landed in state Needs Review.Feb 25 2020, 4:27 AM

Closed by commit rS358308: Before issing the REMOVE_DEVICE command to the firmware, make sure that all (authored by imp). · Explain Why

This revision was automatically updated to reflect the committed changes.

imp added a commit: rS358308: Before issing the REMOVE_DEVICE command to the firmware, make sure that all.