ascq
ClosedPublic
Actions

Authored by imp on Jul 8 2025, 11:17 PM.

Details

Reviewers

mav
ken
jhb

Group Reviewers

cam

Commits

rGd78d04b17cb2: cam: Fail the disk if READ CAPACITY returns 4/2 asc/ascq

Summary

HGST disks that are sick are returning 44/0 for START UNIT (which we
ignore) and then 4/2 on READ CAPACITY. START UNIT should be enough for
READ CAPACITY to succeed or UNIT ATTENTION. However, we get NOT_READ +
4/2 back. I've seen this on several models of HGST drives. Although the
timeout is 5s for READ_CAPACITY, we wait the full 30s for
READ_CAPACITY_16. This causes us to stall booting as we start to taste
as soon as we release the final hold... but the tasting means
g_wait_idle() takes now takes over 5 minutes to clear since we do this
for all the opens.

Perhaps both should use 5s. The READ_CAPACITY_16 code has used either
60s or 30s since it was originally committed in 2003, but that original
commit does not explain why (is there a reason, or was it just something
arbitrary). Perhaps both should be more like 3s. This would also be less
bothersom and would reduce the tasting failure time to 30s or so. But
there's no sense in repeated failures, especailly since there's no way
to re-taste a failure that was due to this. It's better not to adjust
the timeouts here (though that might be warranted) and fail the periph.
Changing the timeouts is orthogonal to this problem.

Perhaps we should fail the periph when START UNIT fails with the same
codes we check in the read capacity path. I'm reluctant to do such a
global change since it's in cam_periph, and there seems no good way to
flag that we want this behavior. It's also a big magical when it runs.