Page MenuHomeFreeBSD

cam/da: Call cam_periph_invalidate on ENXIO in dadone
ClosedPublic

Authored by imp on Jan 25 2025, 8:10 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Feb 21, 9:59 PM
Unknown Object (File)
Wed, Feb 19, 11:00 PM
Unknown Object (File)
Sat, Feb 8, 11:05 PM
Unknown Object (File)
Thu, Feb 6, 11:25 PM
Unknown Object (File)
Thu, Feb 6, 4:56 AM
Unknown Object (File)
Tue, Feb 4, 3:42 AM
Unknown Object (File)
Sun, Feb 2, 7:18 PM
Unknown Object (File)
Fri, Jan 31, 4:43 PM
Subscribers
None

Details

Summary

Use cam_periph_invalidate() instead of just setting the PACK_INVALID
flag in the da softc. It's a more appropriate and bigger hammer for this
case. PACK_INVALID is set as part of that, so remove the now-redundant
setting, but use it to only complain the first time we hit the
condition. This also has the side effect of short-circuiting errors for
other I/O still in the drive which is just about to fail (sometimes with
different error codes than what triggered this ENXIO).

The prior practice of just setting the PACK_INVALID flag, however, was
too ephemeral to be effective.. Since daopen would clear PACK_INVALID
after a successful open, we'd have to rediscover the error (which takes
tens of seconds) for every different geom tasting the drive. These two
factors lead to a watchdog before we could get through all the devices
if we had multiple failed drives with this syndrome. By invalidating the
periph, we fail fast enough to reboot enough to start petting the
watchdog. If we disable the watchdog, the tasting eventually completes,
but takes over an hour which is too long. As it is, it takes an extra
minute per failed drive, which is tolerable.

While cam_periph_error's asc/ascq tables have a SSQ_LOST flag for this
situation, that flag also fails the other periph drivers, like pass,
attached to the device. The docs for these codes are too sparse to help
decide what to do. Err on the side of caution and only invalidate the da
device. Simple commands to collect logs for the vendor still work w/o
hangling the system or other adverse effects. Therefore, I've not added
SSQ_LOST to the asc/ascq table entries for the newer codes.

We can also simplify the logic w/o bloating the change, so do that too.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

The purpose of invalidating the pack in the da(4) driver is to deal with removable media. If the removable disk is pulled while the device is open, the pack is invalidated. Getting through a successful open again clears the flag.

With this change, we wind up invalidating the device, which isn't what we want to do with removable media.

We probably need to distinguish between SIP_MEDIA_FIXED and SIP_MEDIA_REMOVABLE in terms of how we handle errors like this. i.e. you can still get the fast fail behavior you're looking for (which I think is helpful) for fixed disks.

In D48689#1110375, @ken wrote:

The purpose of invalidating the pack in the da(4) driver is to deal with removable media. If the removable disk is pulled while the device is open, the pack is invalidated. Getting through a successful open again clears the flag.

With this change, we wind up invalidating the device, which isn't what we want to do with removable media.

We probably need to distinguish between SIP_MEDIA_FIXED and SIP_MEDIA_REMOVABLE in terms of how we handle errors like this. i.e. you can still get the fast fail behavior you're looking for (which I think is helpful) for fixed disks.

We already do the right thing for the ASC/ASCQ that indicate that the media is gone / changed. For those (mostly 3a) ASC codes we just set the PACK_INVALID bit. The removable media won't change with this change. the error codes we have flagged for EXNIO are typically the drive reporting that some subsystem of the drive has failed, that it can't pass diagnostics, etc.

However, the ENXIO error from daerror indicates that the ASC/ASCQ code from the I/O is believed to always be fatal and no further I/O to the LBAs will be possible. This is true independent of whether the drive has removable media or not. For those that might affect removable media, they are typically marked retry though there may be one or two marked as EIO. EIO is returned from daerror/cam_periph_error() for ordinary I/O errors. Plus we have extra logic in daerror() to cope with things like 3a which explicitly report the media is gone. In those cases, we'll not invalidate the da periph, but just set the PACK_INVALID bit. So unless a removable device fails with these asc/ascq codes and that failure can be fixed by popping in a different disk, I think that this is OK to do for all disks. My experience in the past, admittedly with just JAZ and ZIP drives, is that when there's an error, it will be some kind of media error or some kind of timeout, neither of which will trigger this path.

Without this change, and because we clear PACK_INVALID in daopen unconditionally (or nearly so: we only fail when the periph has already been invalidated), failing drives will be retried over and over in the geom tasting process at boot, which takes tens of minutes to complete per failed drive. 2 or 3 failed drives like this is enough to hit watchdog preventing coming up at all.

OK. I've read though the code another half dozen times. After the first 5 I thought I was still right. But on the 6th time I realized that another change is needed.... this code is suhtle.

The problem is that the current code is serving two different master. There's two errors that are signaled with ENXIO. One is 'the drive can never do I/O again', and the other adds 'until new media is inserted'. to the end. Both cd and da already set a bit to say the media is invalid when we see this 0x3a asc. But then try to set it again when we get ENXIO. I'd like to make ENXIO only for errors that aren't just 'media missing' and have another error for media missing. So w/o changing that for asc 0x3a, you're right, this does break removable media, at least for da.

So there's a few alternatives.

(1) Just do this for non-removeable drives, like you suggest. This is fine, for me, but we'd still have the case where we're talking to a mostly broken drive that has removable media, but the problems aren't just missing media.
(2) I can add SSQ_LOST to a bunch more ASC/ASCQ entries (basically all the ENXIO errors, except 0x3a). This would indirectly cause invalidate the da device, but also the pass device, which I'm trying to avoid.
(3) I could keep the test for PACK_INVALID for the whole if. That way we don't do it for the 0x3a that we saw on media missing errors (since we set that in daerror), but do it for all the other asc/acsq codes. I'd still invalidate then, but since we don't do it with an already invalid pack, the removable media devices would still work. It's imperfect in the face of multiple types of errors, So if the drive really goes away with the media missing, we have two cases. if the SIM signals it with CAM_DEV_NOT_THERE, then we'll do the SSQ_LOST action so that's good. If it's less than a total failure to the SIM, but just a really bad ASC / ASCQ pair, then we'll not invalidate the device. That might be a rare enough event to be OK: we'd already not be doing I/O to the drive, so not having the periph go away is likely fine. We'll clear the PACK_INVALID on the next daopen, and we'd get the other ASC then.
(4) I could invent a new error code for this case. Something between EIO and ENXIO. Let's call it EAGAIN_IO, but I'd have to make sure it doesn't leak outside of CAM. That's a more precise signal, but a bigger hassle.

I'll see if I can make (3) work and test it with qemu + scsi cd at least. I'll see if there's some VM that can also implement a ZIP or JAZZ drive. I no longer have removable da devices, though, that aren't usb sticks that have the SIM and periph go away too. If I can't make (3) work, I'll fall back to (1).

Looks like qemu can do removable da disks. So I should be able to test both.
And I also noticed that linux has ENOMEDIUM for my idea of having something between EIO and ENXIO. Too bad it's so hard to add a new error type because the size of the errno string array is baked into static binaries...

So thanks for the feedback, and sorry I was a bit unbelieving of your feedback at first.

Update after careful consideration of Ken's very helpful feedback.

Still need to test this on qemu, but it looks like I can make both cd and da
devices that have removable media with it.

I've expanded the comments a bit as well, since this is easily one of the most
confusing and subtle details of the da driver, especailly since pack
invalidation and periph invalidating are easily confused concepts.

This change require DA_FLAG_PACK_INVALID to be reliable avoid false periph invalidations. I am not sure current asc == 0x3a covers all the cases, I have feeling that 0x28/0x00 Not ready to ready change, medium may have changed is even more widespread case of media change and so pack invalidation.

In D48689#1110732, @mav wrote:

This change require DA_FLAG_PACK_INVALID to be reliable avoid false periph invalidations. I am not sure current asc == 0x3a covers all the cases, I have feeling that 0x28/0x00 Not ready to ready change, medium may have changed is even more widespread case of media change and so pack invalidation.

While I've never seen this code, a google search tells me that it is quite common. Even if it wasn't, I'd want to deal with it.

Today, we'd invalidate the media in the place I changed, just like we would for 0x3a (except 0x3a we also mark it earlier in daerror). Is there some reason we couldn't do the same for 0x28 as well in the same cascading if statement? Is there some reason we don't do that there, but defer it to the return path where we definitely will since it will be returned from cam_periph_error (or is that due to daerror being called multiple places)? Or do we need to do something else? ASC 0x28 appears to happen even on drives that probe as not removable (judging from the google search only, I can't test this directly). If we need to do something else, is that something else a TUR or ???? The other ASC we test for in daerror don't seem relevant, do you agree?

In D48689#1110732, @mav wrote:

This change require DA_FLAG_PACK_INVALID to be reliable avoid false periph invalidations. I am not sure current asc == 0x3a covers all the cases, I have feeling that 0x28/0x00 Not ready to ready change, medium may have changed is even more widespread case of media change and so pack invalidation.

So I looked into this. It took me a while, so bear with me.

First, we're not setting PACK_INVALID with 28/00 today, so I'm not understanding the context of your comment well enough.... We're just sending the message that the medium changed to geom_disk. It currently leaves the device valid for a bit and requests a retry of the command since 28/00 is always a unit attention status and we set RETRY_UA in our call to cam_periph_error. If the command works on retry (which it should, since getting the sense code should clear the sense code), then nothing changes. If the command persists on failing, it's likely hopeless anyway. I saw in one bug report on one of the forums where a read condition kept failing with 28/00 (from a post dated 2008), but could find no other evidence of it being in response to anything other than a TUR *OR* the I/O going on to succeed. But absence of evidence is not evidence of absence. So I tried all the removable USB devices I had, and couldn't trigger any 0x28/00 responses at all (at least that were printed).. So I'm at a bit of a loss for how to recreate those errors. Since you committed the rework that introduced the check, perhaps you have suggestions for hw or devices that trigger it? The only mention of it the standard I can find for it is for tape controllers (apc-3).

So, in the case where 28/00 persists, we'd invalidate the device rather than just the media with this change. I'm not entirely sure that it would (since it communicates an edge condition) and some other error condition would be communicated and we'd use that instead. If we keep getting this edge condition over and over for each retry of an I/O we do, I'm not sure we can have a sane reply.

I think that's OK. If we can find some path that gets back here via ENXIO for this condition, then maybe we could mark the PACK_INVALID in the 0x28 if clause in daerror. That would keep approximately the same behavior for this condition, but I don't know of what else it would affect. daopen would need to be called, but the internet suggests that 28/00 is most offen associated with TUR which happens before open so that would be fine. If it's after open for some reason, I think we'd have to go through open again anyway since our upper level disk has withered if I'm reading the media change code correctly, so maybe that wouldn't be a bad idea. I'm unsure and hesitant because I can't test the condition of concern directly.

This revision was not accepted when it landed; it landed in state Needs Review.Sat, Feb 8, 9:43 PM
This revision was automatically updated to reflect the committed changes.