Page MenuHomeFreeBSD

cam: Add 3e/3 as a fatal code
ClosedPublic

Authored by imp on Fri, Jan 17, 8:17 PM.
Tags
None
Referenced Files
F108964510: D48505.diff
Thu, Jan 30, 1:40 AM
Unknown Object (File)
Sat, Jan 25, 11:25 AM
Unknown Object (File)
Fri, Jan 24, 8:26 PM
Unknown Object (File)
Fri, Jan 24, 9:58 AM
Unknown Object (File)
Tue, Jan 21, 7:03 PM
Unknown Object (File)
Tue, Jan 21, 10:40 AM
Unknown Object (File)
Tue, Jan 21, 12:58 AM
Unknown Object (File)
Mon, Jan 20, 11:01 PM
Subscribers
None

Details

Summary

We see this error:

(da4:mps0:0:3:0): SCSI sense: HARDWARE FAILURE asc:3e,3 (Logical unit failed self-test)

for drives that have failed. Our vendor tells us there's no recovery
from that state, though we can still grab logs from the drives and run
their diagnostics. Drives in this state need to bascially be
remanufactured because some part of them has failed. The prior default
behavior is to retry, and retrying takes a long time to work
out. Instead, short-circuit the retries and fail right away. I selected
ENXIO because no I/O to LBAs is possible for drives in this state (both
my experience and per vendor). Some googling suggests that other vendors
behave identically, but it was inconclusive. Should this be too
pessimistic, we can adjust in the future. Also, this is with some aging
drives in our fleet, and if we have more than one drive in this state,
our systems take so long to get to mountroot that the watchdog fires
sometimes. Adding this patch makes them boot reliably again.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable