Page MenuHomeFreeBSD

cam/da: Call cam_periph_invalidate on ENXIO in dadone
Needs ReviewPublic

Authored by imp on Sat, Jan 25, 8:10 PM.

Details

Reviewers
mav
chs
ken
Group Reviewers
cam
Summary

Use cam_periph_invalidate() instead of just setting the PACK_INVALID
flag in the da softc. It's a more appropriate and bigger hammer for this
case. PACK_INVALID is set as part of that, so remove the now-redundant
setting, but use it to only complain the first time we hit the
condition. This also has the side effect of short-circuiting errors for
other I/O still in the drive which is just about to fail (sometimes with
different error codes than what triggered this ENXIO).

The prior practice of just setting the PACK_INVALID flag, however, was
too ephemeral to be effective.. Since daopen would clear PACK_INVALID
after a successful open, we'd have to rediscover the error (which takes
tens of seconds) for every different geom tasting the drive. These two
factors lead to a watchdog before we could get through all the devices
if we had multiple failed drives with this syndrome. By invalidating the
periph, we fail fast enough to reboot enough to start petting the
watchdog. If we disable the watchdog, the tasting eventually completes,
but takes over an hour which is too long. As it is, it takes an extra
minute per failed drive, which is tolerable.

While cam_periph_error's asc/ascq tables have a SSQ_LOST flag for this
situation, that flag also fails the other periph drivers, like pass,
attached to the device. The docs for these codes are too sparse to help
decide what to do. Err on the side of caution and only invalidate the da
device. Simple commands to collect logs for the vendor still work w/o
hangling the system or other adverse effects. Therefore, I've not added
SSQ_LOST to the asc/ascq table entries for the newer codes.

We can also simplify the logic w/o bloating the change, so do that too.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 62032
Build 58916: arc lint + arc unit

Event Timeline

The purpose of invalidating the pack in the da(4) driver is to deal with removable media. If the removable disk is pulled while the device is open, the pack is invalidated. Getting through a successful open again clears the flag.

With this change, we wind up invalidating the device, which isn't what we want to do with removable media.

We probably need to distinguish between SIP_MEDIA_FIXED and SIP_MEDIA_REMOVABLE in terms of how we handle errors like this. i.e. you can still get the fast fail behavior you're looking for (which I think is helpful) for fixed disks.

In D48689#1110375, @ken wrote:

The purpose of invalidating the pack in the da(4) driver is to deal with removable media. If the removable disk is pulled while the device is open, the pack is invalidated. Getting through a successful open again clears the flag.

With this change, we wind up invalidating the device, which isn't what we want to do with removable media.

We probably need to distinguish between SIP_MEDIA_FIXED and SIP_MEDIA_REMOVABLE in terms of how we handle errors like this. i.e. you can still get the fast fail behavior you're looking for (which I think is helpful) for fixed disks.

We already do the right thing for the ASC/ASCQ that indicate that the media is gone / changed. For those (mostly 3a) ASC codes we just set the PACK_INVALID bit. The removable media won't change with this change. the error codes we have flagged for EXNIO are typically the drive reporting that some subsystem of the drive has failed, that it can't pass diagnostics, etc.

However, the ENXIO error from daerror indicates that the ASC/ASCQ code from the I/O is believed to always be fatal and no further I/O to the LBAs will be possible. This is true independent of whether the drive has removable media or not. For those that might affect removable media, they are typically marked retry though there may be one or two marked as EIO. EIO is returned from daerror/cam_periph_error() for ordinary I/O errors. Plus we have extra logic in daerror() to cope with things like 3a which explicitly report the media is gone. In those cases, we'll not invalidate the da periph, but just set the PACK_INVALID bit. So unless a removable device fails with these asc/ascq codes and that failure can be fixed by popping in a different disk, I think that this is OK to do for all disks. My experience in the past, admittedly with just JAZ and ZIP drives, is that when there's an error, it will be some kind of media error or some kind of timeout, neither of which will trigger this path.

Without this change, and because we clear PACK_INVALID in daopen unconditionally (or nearly so: we only fail when the periph has already been invalidated), failing drives will be retried over and over in the geom tasting process at boot, which takes tens of minutes to complete per failed drive. 2 or 3 failed drives like this is enough to hit watchdog preventing coming up at all.