Differential D34977

ada: Retry commands with retries left on CAM_SEL_TIMEOUT
ClosedPublic
Actions

Authored by imp on Apr 20 2022, 4:16 AM.

Details

Reviewers

mav

Group Reviewers

cam

Commits

rG6c8ab086fed3: ada: Retry commands with retries left on CAM_SEL_TIMEOUT

Summary

The ahci driver will retrun CAM_SEL_TIMEOUT when we have a command that
times out. When this command is a regular I/O command, we should retry
the command rather than failing it and invalidating the device. If the
device really is gone for good, the retries will fail and we'll do that
eventually. If this is a transient error, then we can recover from it
gracefully.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

imp created this revision.Apr 20 2022, 4:16 AM

Herald added a reviewer: cam. · View Herald TranscriptApr 20 2022, 4:16 AM

imp requested review of this revision.Apr 20 2022, 4:16 AM

Harbormaster completed remote builds in B45247: Diff 105196.Apr 20 2022, 4:16 AM

imp edited the test plan for this revision. (Show Details)Apr 20 2022, 4:17 AM

imp added a reviewer: mav.

This flag was originally in 52c9ce25d8339 when Scott committed the great ATA/SCSI split in July 2009.
A few months later, mav removed it in 46f118fe3f73a when he merged in the code to support PIO only devices in October 2009.
With the p4 repo gone, it's quite hard to know if this was an intentional cleanup, a cut and paste error or ?????
The corresponding call in scsi_da.c has this flag, and it seems to make sense to me that we 'd want it.
I think this is just an old mistake that makes recovering on a SATA drive less robust than on a SAS for no good reason, but I'd like to know if there is a reason or not...

jrtc27 added a subscriber: jrtc27.Apr 20 2022, 3:49 PM

This revision was not accepted when it landed; it landed in state Needs Review.May 1 2022, 5:11 PM

Closed by commit rG6c8ab086fed3: ada: Retry commands with retries left on CAM_SEL_TIMEOUT (authored by imp). · Explain Why

This revision was automatically updated to reflect the committed changes.

imp added a commit: rG6c8ab086fed3: ada: Retry commands with retries left on CAM_SEL_TIMEOUT.

The only case I see AHCI to return CAM_SEL_TIMEOUT is:

 if (ch->devices == 0 ||
    (ch->pm_present == 0 &&
     ccb->ccb_h.target_id > 0 && ccb->ccb_h.target_id < 15)) {
        ccb->ccb_h.status = CAM_SEL_TIMEOUT;
        break;
}

, which means device is not detected. I don't see how it can be transient and why it should be retried.

In D34977#795715, @mav wrote:
The only case I see AHCI to return CAM_SEL_TIMEOUT is:
 if (ch->devices == 0 ||
    (ch->pm_present == 0 &&
     ccb->ccb_h.target_id > 0 && ccb->ccb_h.target_id < 15)) {
        ccb->ccb_h.status = CAM_SEL_TIMEOUT;
        break;
}
, which means device is not detected. I don't see how it can be transient and why it should be retried.

It's not present, at the moment. However, I've observed that the device can be present a few tens or hundreds
of milliseconds later. We have several systems that exhibit this behavior. The device drops off the bus, we don't
detect it on the reset, and then some very short time later the SATA link is re-established and we detect it
again. We've noticed that often times it's perfectly fine after this event. These transients are caused by
a hardware issue (it shouldn't brown out like that), but it's a hardware issue that we have to live with.

I've seen this 'race' the device tear down such that the device is detected and back before we can complete
the tear down of the periph. There's also a race in the teardown code such that transactions are still
schedule, but references aren't held so the periph goes away. By retrying here, we avoid the race
(though not completely, so I'm also working on fixing the underlying race).

I should ask if there's some traces I can provide to help you understand the exact sequence of events?
I have it half instrumented locally, which is how I reached this conclusion, but if there are specific
events you'd like, I can add that and see if they trigger in our fleet (or if we're really lucky my test
harness that's basically just a switch that I can use to interrupt power to the drive momentarily).

Revision Contents
Changeset List

Path

Size

sys/

cam/

ata/

ata_da.c

2 lines

Diff 105619

View Options

ada: Retry commands with retries left on CAM_SEL_TIMEOUTClosedPublicActions