HomeFreeBSD

cam/da: Call cam_periph_invalidate on ENXIO in dadone

Description

cam/da: Call cam_periph_invalidate on ENXIO in dadone

Use cam_periph_invalidate() instead of just setting the PACK_INVALID
flag in the da softc. It's a more appropriate and bigger hammer for this
case. PACK_INVALID is set as part of that, so remove the now-redundant
setting. This also has the side effect of short-circuiting errors for
other I/O still in the drive which is just about to fail (sometimes with
different error codes than what triggered this ENXIO).

The prior practice of just setting the PACK_INVALID flag, however, was
too ephemeral to be effective.. Since daopen would clear PACK_INVALID
after a successful open, we'd have to rediscover the error (which takes
tens of seconds) for every different geom tasting the drive. These two
factors lead to a watchdog before we could get through all the devices
if we had multiple failed drives with this syndrome. By invalidating the
periph, we fail fast enough to reboot enough to start petting the
watchdog. If we disable the watchdog, the tasting eventually completes,
but takes over an hour which is too long. As it is, it takes an extra
minute per failed drive, which is tolerable.

When the PACK_INVALID flag is already set, just flush remaining I/Os
with ENXIO. This bit will be set either when we've called
cam_periph_invalidate() before (so we've just waiting for the I/Os to
complete) or more typically when we've seen an ASC 0x3a, which is the
catch all for 'drive is otherwise OK, we're just missing the media to
get data from'. In the latter case, we do not want to invalidate the
periph since we allow recovery from this with a trip through daopen().

While cam_periph_error's asc/ascq tables have a SSQ_LOST flag for
failing the entire drive, I've opted not to use that. That flag will
also causes all attached drivers, like pass, to detach, which is
undesireable. By not adding that flag, but just invalidating the da
periph driver, we prevent I/Os, but still allow collection of logs from
the device.

We can also simplify the logic w/o bloating the change, so do that too.

Finally, this has been tested on all the removeable/non-removeable disks
I could find, cd players, combo cd/da memory sticks, etc. I've removed
the media while doing I/O on several of them. With these changes, we
handle things corretly in all the cases I tested (except partially
inserted media, which fails chaotically the same as before). The numbre
of devices out there is, however, huge.

mav@ raised concerns about what happens when we have asc/ascq 28/0. I
see that on boot for one of my cards (that's not autoquirked) and as
preditected in the review, we retry that transaction and we get proper
behavior. To be fair, though, I only ever saw it at startup where it was
a transient. I couldn't get some of my energy saving disks to ever throw
that ASC/ASCQ, even after they spun down, so I've not tested that case.

Sponsored by: Netflix
Discussed with: mav@
Differential Revision: https://reviews.freebsd.org/D48689

Details

Provenance
impAuthored on Sat, Feb 8, 9:31 PM
Differential Revision
D48689: cam/da: Call cam_periph_invalidate on ENXIO in dadone
Parents
rG82fc49a0bebf: cam/da: Only mark pack as valid if we know the size in daopen
Branches
Unknown
Tags
Unknown