Page MenuHomeFreeBSD

ada: Retry commands with retries left on CAM_SEL_TIMEOUT
ClosedPublic

Authored by imp on Apr 20 2022, 4:16 AM.
Tags
None
Referenced Files
F83121103: D34977.id105619.diff
Mon, May 6, 2:27 PM
Unknown Object (File)
Fri, Apr 19, 3:25 PM
Unknown Object (File)
Thu, Apr 18, 5:33 AM
Unknown Object (File)
Tue, Apr 16, 1:46 PM
Unknown Object (File)
Thu, Apr 11, 11:45 PM
Unknown Object (File)
Thu, Apr 11, 7:54 PM
Unknown Object (File)
Mar 7 2024, 3:52 PM
Unknown Object (File)
Feb 24 2024, 11:11 AM
Subscribers

Details

Summary

The ahci driver will retrun CAM_SEL_TIMEOUT when we have a command that
times out. When this command is a regular I/O command, we should retry
the command rather than failing it and invalidating the device. If the
device really is gone for good, the retries will fail and we'll do that
eventually. If this is a transient error, then we can recover from it
gracefully.

Sponsored by: Netflix

Test Plan

This is related to a bug that the CHERI folks are seeing. They have a AHCI controller that is misbehaving and timing out. We don't retry the I/O, but instead fail the device. We should retry the I/O and this allows that (I think, I'm still waiting for confirmation from them).

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

imp requested review of this revision.Apr 20 2022, 4:16 AM
imp added a reviewer: mav.

This flag was originally in 52c9ce25d8339 when Scott committed the great ATA/SCSI split in July 2009.
A few months later, mav removed it in 46f118fe3f73a when he merged in the code to support PIO only devices in October 2009.
With the p4 repo gone, it's quite hard to know if this was an intentional cleanup, a cut and paste error or ?????
The corresponding call in scsi_da.c has this flag, and it seems to make sense to me that we 'd want it.
I think this is just an old mistake that makes recovering on a SATA drive less robust than on a SAS for no good reason, but I'd like to know if there is a reason or not...

This revision was not accepted when it landed; it landed in state Needs Review.May 1 2022, 5:11 PM
This revision was automatically updated to reflect the committed changes.

The only case I see AHCI to return CAM_SEL_TIMEOUT is:

 if (ch->devices == 0 ||
    (ch->pm_present == 0 &&
     ccb->ccb_h.target_id > 0 && ccb->ccb_h.target_id < 15)) {
        ccb->ccb_h.status = CAM_SEL_TIMEOUT;
        break;
}

, which means device is not detected. I don't see how it can be transient and why it should be retried.

In D34977#795715, @mav wrote:

The only case I see AHCI to return CAM_SEL_TIMEOUT is:

 if (ch->devices == 0 ||
    (ch->pm_present == 0 &&
     ccb->ccb_h.target_id > 0 && ccb->ccb_h.target_id < 15)) {
        ccb->ccb_h.status = CAM_SEL_TIMEOUT;
        break;
}

, which means device is not detected. I don't see how it can be transient and why it should be retried.

It's not present, at the moment. However, I've observed that the device can be present a few tens or hundreds
of milliseconds later. We have several systems that exhibit this behavior. The device drops off the bus, we don't
detect it on the reset, and then some very short time later the SATA link is re-established and we detect it
again. We've noticed that often times it's perfectly fine after this event. These transients are caused by
a hardware issue (it shouldn't brown out like that), but it's a hardware issue that we have to live with.

I've seen this 'race' the device tear down such that the device is detected and back before we can complete
the tear down of the periph. There's also a race in the teardown code such that transactions are still
schedule, but references aren't held so the periph goes away. By retrying here, we avoid the race
(though not completely, so I'm also working on fixing the underlying race).

I should ask if there's some traces I can provide to help you understand the exact sequence of events?
I have it half instrumented locally, which is how I reached this conclusion, but if there are specific
events you'd like, I can add that and see if they trigger in our fleet (or if we're really lucky my test
harness that's basically just a switch that I can use to interrupt power to the drive momentarily).