Page MenuHomeFreeBSD

cd: Maintain a periph reference over async media checks
ClosedPublic

Authored by markj on Jan 20 2024, 6:54 PM.
Tags
None
Referenced Files
Unknown Object (File)
Fri, Jun 21, 6:31 PM
Unknown Object (File)
Fri, Jun 21, 10:37 AM
Unknown Object (File)
Fri, Jun 21, 9:35 AM
Unknown Object (File)
Sun, May 26, 3:04 AM
Unknown Object (File)
May 16 2024, 7:51 PM
Unknown Object (File)
May 16 2024, 8:06 AM
Unknown Object (File)
May 16 2024, 8:06 AM
Unknown Object (File)
May 12 2024, 7:58 AM
Subscribers
None

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.Jan 20 2024, 6:54 PM
markj created this revision.

The int to bool thing likely should be a separate commit though.
We don't call the strategy routine w/o calling open. geom calls it to taste the drive, and it doesn't do that until it acqures access which calls d_open.
I don't think that we can be withered until tasting is complete.
And I don't think we fail the I/O in the devices... those have to complete before the withering proceeds....

But... cdoninvalidate will flush the queue of I/O requests with an error and call disk_gone... that would mean the I/O that's queued up waiting for the media check would be completed while the media check is running...

So doing an aquire I think is needed to keep the cdoninvalidate from running until this is done. While we had an acquire that we got from cdopen when we entered, the I/O that followed that open is queued, so that cdclose might run once geom gets that I/O failed. the cddiskgonecb doesn't release the acquire from open, but a special one we take just before creating the disk.

About the normal state: we enter that once we've 'probed' the CD in our startup state machine. So we're able to do enough I/O to the drive to get its device particulars.... So we're completing the single stepping probe, so that's not the issue I saw: the probe would single step and fail, but an async call would then try to do I/O on the now defunct device.

tl'dr: while I may quibble where you do the acquire, I think I've talked myself into agreeing it's necessary because the I/O from the open isn't enough to keep close from running.

sys/cam/scsi/scsi_cd.c
2698

So if we get an error from acquire for the chstategy case, then there's nothing to clock the CD I/o that triggered this cdcheckmedia.

However, that's if periph is NULL (not here) or if the flags say it is invalid. That means we're racing a destroy and should get out, so that's OK I think. The oninvalidate callback will take care of failing the I/O that caused this cdcheckmedia.

Though I don't think this is the right place for it... Or maybe the conditional is wrong.. I'm unsure...

Slight change of mind after explaining in the bug and thinking what I wrote wasn't quite right. I don't think sync vs async is a useful disctinction: we're sending I/O to the device to do a thing that the cd driver wants independent of what others want, and that needs an acquire even if other things we're doing might happen to hold a ref when we do it sync.

sys/cam/scsi/scsi_cd.c
2698

That is to say, maybe we should always acquire here even in the sync cases. The sync cases have a ref that means they won't go away, true. But we're doing something that needs a ref and making it conditional of the 'need' likely is asking for trouble. If we always do it, then we'll be safe even if the sync cases do something weird like release the ref they started the mediacheck with out of band so that the device can go away while they are sleeping (that's likely also a bug, but the extra layer here may help us fail safe on that bug). So I don't think we should check do_wait here, but always take it out. I don't think we should move this like I suggested a few minutes ago to cdstrategy. Then in the done routine, we always release it. That would be more robust and more in keeping with a 'I need a reference to do a thing' rather than 'Oh, it might be unsafe, I'll take a reference to do that thing'. Acquire is cheap enough and this is infrequent. the xpt bus lock isn't contested.

In D43525#992487, @imp wrote:

The int to bool thing likely should be a separate commit though.

Ok.

We don't call the strategy routine w/o calling open. geom calls it to taste the drive, and it doesn't do that until it acqures access which calls d_open.
I don't think that we can be withered until tasting is complete.
And I don't think we fail the I/O in the devices... those have to complete before the withering proceeds....

But... cdoninvalidate will flush the queue of I/O requests with an error and call disk_gone... that would mean the I/O that's queued up waiting for the media check would be completed while the media check is running...

So doing an aquire I think is needed to keep the cdoninvalidate from running until this is done. While we had an acquire that we got from cdopen when we entered, the I/O that followed that open is queued, so that cdclose might run once geom gets that I/O failed. the cddiskgonecb doesn't release the acquire from open, but a special one we take just before creating the disk.

Right.

About the normal state: we enter that once we've 'probed' the CD in our startup state machine. So we're able to do enough I/O to the drive to get its device particulars.... So we're completing the single stepping probe, so that's not the issue I saw: the probe would single step and fail, but an async call would then try to do I/O on the now defunct device.

Note that the first attachment in PR 276251 contains this line:

Jan  5 09:57:53 master kernel: cd0: 0MB (1 0 byte sectors)

Presumably we hadn't yet successfully queried the device, hence the async query from cdstrategy().

Note that the first attachment in PR 276251 contains this line:

Jan  5 09:57:53 master kernel: cd0: 0MB (1 0 byte sectors)

Presumably we hadn't yet successfully queried the device, hence the async query from cdstrategy().

Kinda. If we print the above line, we have successfully queried the device. We went through our startup sequence, found there was no media in the device and made a note of that. So we're in the normal state now. However, since we noted we didn't have media cdstrategy now will do the mediachange query before any reads or writes are allowed through.

The only reason I parse carefully what you said is because entering normal mode is an important detail for this driver.

In D43525#992496, @imp wrote:

Note that the first attachment in PR 276251 contains this line:

Jan  5 09:57:53 master kernel: cd0: 0MB (1 0 byte sectors)

Presumably we hadn't yet successfully queried the device, hence the async query from cdstrategy().

Kinda. If we print the above line, we have successfully queried the device. We went through our startup sequence, found there was no media in the device and made a note of that. So we're in the normal state now. However, since we noted we didn't have media cdstrategy now will do the mediachange query before any reads or writes are allowed through.

The only reason I parse carefully what you said is because entering normal mode is an important detail for this driver.

Ok, I agree.

sys/cam/scsi/scsi_cd.c
2698

Note that cdstrategy() explicitly checks for an invalid periph before calling cdcheckmedia(), so I'm pretty sure an error can't actually occur here in the async case.

So the idea behind always acquiring the ref is that it ensures that synchronous probing is protected even if the sleeping thread wakes up and releases its ref before that state machine has fully run? Ok, that seems reasonable.

Acquire a reference unconditionally.

This might fix. The int to bool thing likely should be a separate commit though.

This revision is now accepted and ready to land.Jan 29 2024, 4:34 AM