Currently we might call into filesystem or pager code when completing a
non-dump CCB, which typically results in assertion failures or
deadlocks when dumping. D4647 attempted to handle this with a dedicated
CAM doneq, but as was pointed out in the review, that solution may cause
us to fail to unfreeze the device after an error. Try to handle the
problem a different way: filter non-dump CCBs in xpt_done_process(), and
attempt to unfreeze the devq after an error if we're running after a
panic. Stash non-dump CCBs on a list so that they can be inspected
during a debugging session.
Details
pho@ confirmed that the patch resolved one reproducible instance of
the problem.
Diff Detail
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 16881 Build 16758: arc lint + arc unit
Event Timeline
sys/cam/cam_xpt.c | ||
---|---|---|
5489 | dumping || SCHEDULER_STOPPED() is what we use elsewhere in cam_xpt.c to do the polling for the dump. We do this so that we can break to the debugger and call doadump(). In that case, dumping is true, but SCHEDULER_STOPPED() isn't. As such, I think the right test here is just 'dumping'. I'm racking my brains, and having trouble recalling when SCHEDULER_STOPPED would be true w/o dumping being true, but the comment says shutdown_post_sync() callback which I think we do things like idle the drive so that (some?) SSDs flush their SMART data because this is a clean shutdown. If you have && here, those callbacks don't get called... I need to study more to see if this is an issue or not. |
sys/cam/cam_xpt.c | ||
---|---|---|
5489 | Also, do you want the callbacks called from call to 'doadump'? |
sys/cam/cam_xpt.c | ||
---|---|---|
5489 | I don't think we want to execute this code if someone breaks into the debugger and calls doadump directly. Suppose a non-dump CCB gets stashed in the stailq in that case. After we leave the debugger, that CCB will never be completed and the system will likely hang. Either way, calling doadump from the debugger before a panic is fundamentally less reliable than dumping after a panic, and I'm really only trying to improve the latter case. That was my rationale for using &&. I don't quite follow the comment about post_sync callbacks. They are invoked before doadump() is called, so I don't see how the && affects them in any way. |
sys/cam/cam_xpt.c | ||
---|---|---|
5489 | You'll need to update the comment then. Say specifically, we aren't stopping the calls during a manual dump, and this only stops the calls after we've paniced and started the actual dump. post_sync_callbacks will run with this change, which we need them to. They will run with polling, but that still goes through this callback... |
So I'm still uneasy about this.
Why are we doing this at the lowest levels? Some of the CCBs that might be completed are for components that are fine during a dump. This is an upper layer problem, not a CAM problem. Wouldn't it make more sense to do this in biodone or bufdone? That way we don't have to worry about tagging every last little thing with CAM_CCB_DUMP that needs it. Since we're doing the dump, that won't go through the upper layers at all. If we put the block at the entry to geom, or the exit from geom, that would accomplish the same thing.
Tagging things with CAM_CCB_DUMP means that this new mechanism is opt-in, so it's helping minimize the amount of code that might be executed in a panic+dump context. I don't really agree with the notion that tagging things with CAM_CCB_DUMP is arduous; all I had to do was look at adadump(), dadump() and ndadump(). Your suggestion means that kernel code has to opt out instead. So for example, cam_iosched_bio_done() has to work in panic context. It does now, but there's no way to enforce that constraint, so if that stops being true in the future we'll have a new failure mode. I think we end up playing whack-a-mole either way.
I'll grant that biodone() is a fairly central routine and that moving the check there is probably good enough to address 99% of the problems, so I'm ok with making that compromise if you still prefer.
I can't say that I like it, but it makes sense to me.
Proposed biodone() to me sounds like too high level to handle this, since GEOM transformations can be complicated enough to corrupt the system state. Plus some code like ZFS may not use it at all.
I think you misunderstand. biodone is the lower entry point into geom. If we close it off there, then we won't enter into geom at all (though honestly, the upper layers should know if they can take I/O completions during panic: but we have no good place to hang that above geom, and the producer / consumer may not be terrible). Also, ZFS necessarily uses biodone because it must submit BIOs to the lower layers to get any I/O to happen. There's no way it couldn't. It doesn't use buf, so shutting the door at the geom_vfs_strategy layer wouldn't shut the door into ZFS, but ZFS is a big boy and can cope itself were we to push it that high. I agree, though, that at this time, that's too high, just as CAM is too low.
I like this a lot better. I can't think of any special case I/O we'd need to make an exception on. All the normal commands are what you'd want to block. The abnormal ones are things like FD_FORMAT which will screw things up, but completing the command won't screw them up worse, so that's good. The BIO_ZONE stuff is done via ioctls which won't screw up if we don't complete it. I'd like it if we could allow some clients complete and others be blocked, but that's somewhat beyond the scope.
Keep in mind that although the BIO_ZONE commands are currently done only via ioctl, they are designed to be queued from a filesystem along with other read and write BIOs. So please don't assume that they will always only be called from ioctls; that will change as soon as we have an SMR-capable filesystem.
It seems like this approach may continue to submit non-dump IO to the disk during panic via xpt_run_allocq, which is maybe dangerous if we paniced because a filesystem detected some sort of corruption. I am not very familiar with CAM and may be misreading the flow of execution.
During a panic, we stop the scheduler. No scheduler, only dump will schedule I/Os that modify the disk. CAM needs to run many non-read/write I/Os during or after the dump to (a) flush the blocks written by the dump and (b) properly shutdown all devices. Since we don't call the biodone completion routines, we can't schedule any I/O via that route either. Finally, all interrupts are disabled during a panic, so fast interrupt handlers can't be the source of a chain of events that sets off new I/O via a non-biodone vector. Since there's nothing that can schedule I/O, this method is completely adequate to prevent dangerous things from happening, and is not dependent on hacking a bunch of CAM drivers to do unnatural things.
Now, you can still get completions and new I/O if you call doadump because that doesn't stop the scheduler. This is considered a feature.
Indeed, but that's true regardless... by the time an FS has detected corruption lots of I/O may have completed.
CAM has this pattern where at completion, it tries to submit any pending IO. So is the consensus that we're ok submitting any existing IO on the cam_iosched queue that happened to be there prior to panic — in addition to dump-related IO?
Sure, but the magnitude of a race definitely matters.
The concern that raised this was the shutdown_final eventhandler in mps(4). It is panic-naive and attempts to use ordinary, non-polled CCB to spin down attached disks. It implements a sort of crappy poll/timeout mechanism using pause and getmicrotime. In CURRENT this maybe works ok because pause has a hack to fall back on DELAY() if we're in the debugger or panic, but kinda doesn't since getmicrotime stops working during panic.
So the concern is that the dumping && SCHED_STOPPED check may be too narrow. dumping is cleared when doadump returns. And e.g., the shutdown_final event is invoked during panic reboot. Many(?) (some, anyway) disk controller drivers register eventhandlers that attempt to submit CCBs during shutdown.
One unrelated question about this specific change — this is written in such a way that doadump can be invoked from a live system via ddb, but it seems like anything that gets shunted onto the nondump_bios list has a pretty high chance of timing out or deadlocking the system. It seems like it needs a hook to run bio completions once ddb is exited in order to be more correct.
A second unrelated question — it looks like xpt_polled_action is deadcode in CURRENT. If we're happy with this approach, maybe it should just be killed? Thanks.
Yes. This isn't new I/O. It's already been submitted at the time of the crash. Since the upper layers basically have no control over the order of the I/O once it gets here, to complete some but not others of it invites filesystem corruption. That's a bigger problem than not completing the I/O when there is corruption by and order of magnitude or more (the typical case is by far that you don't have corruption). You have to finish this I/O, if you can, IMHO.
Also, almost all of the time, these queues are empty because it's hard to keep the hardware queues full.
Sure, but the magnitude of a race definitely matters.
The concern that raised this was the shutdown_final eventhandler in mps(4). It is panic-naive and attempts to use ordinary, non-polled CCB to spin down attached disks. It implements a sort of crappy poll/timeout mechanism using pause and getmicrotime. In CURRENT this maybe works ok because pause has a hack to fall back on DELAY() if we're in the debugger or panic, but kinda doesn't since getmicrotime stops working during panic.
This is an example of I/O that should finish when shutting down the system....
DELAY looks like it will keep working on most systems, though. if we know the TSC frequency, and it's invariant, we'll use rdtsc32 which doesn't stop working. And even for those systems, we use the timecounter function that gets the count from the timecounter hardware, so that part of the function will keep working for all the timecounters I'm aware of. It does call sched_pin() and sched_unpin(), but those just dereference curproc which is still valid during a panic. So the right question here is why wasn't it working for you.
So the concern is that the dumping && SCHED_STOPPED check may be too narrow. dumping is cleared when doadump returns. And e.g., the shutdown_final event is invoked during panic reboot. Many(?) (some, anyway) disk controller drivers register eventhandlers that attempt to submit CCBs during shutdown.
There should be no problem with them doing so. Again, there's some I/O that's not read/write that's necessary to properly shutdown some devices (even if the request doesn't make it to the drive itself, since the firmware on the disk controller intercepts it and does its own magic). We absolutely must do this I/O, even in the case of a panic, or data loss will result. It's my understand that it's really non-optional. It's the basis for killing the 'do this at the CAM layer' argument.
One unrelated question about this specific change — this is written in such a way that doadump can be invoked from a live system via ddb, but it seems like anything that gets shunted onto the nondump_bios list has a pretty high chance of timing out or deadlocking the system. It seems like it needs a hook to run bio completions once ddb is exited in order to be more correct.
A second unrelated question — it looks like xpt_polled_action is deadcode in CURRENT. If we're happy with this approach, maybe it should just be killed? Thanks.
I think that's a Scott question. I wrote the code that obsoleted it. I think Scott may have held me back from removing it from the tree due to out of tree consumers of this interface.
Filesystems have to handle the race in any case. As Warner pointed out, by this point the filesystem has already submitted the I/O requests in question to the lower layers.
One unrelated question about this specific change — this is written in such a way that doadump can be invoked from a live system via ddb, but it seems like anything that gets shunted onto the nondump_bios list has a pretty high chance of timing out or deadlocking the system. It seems like it needs a hook to run bio completions once ddb is exited in order to be more correct.
We only queue items in nondump_bios after a panic. The behaviour when one breaks into ddb and manually triggers a dump is unchanged. That operation is always dangerous and has a chance of hanging the system.
The problem is that they will hang forever if they wait for an interrupt-driven CCB completion during panic — preventing reboot. After dump completes, nothing is polling.
How do you figure? The code does not check for panicstr.
We should be polling in that case. If we're not polling, then we're doing the test wrong and that should be fixed...
Looking closely, it seems that the mps driver allocates a bunch of CCBs and then fires and forgets, assuming their completion routines will run... Which won't happen. The commands will be sent, but we'll not process the completions because interrupts are disabled. this is a bug in the mps driver. In this case, though, we'll just get a scary message saying there was a timeout... To fix this, we'd need to poll somehow. In this case, I think the following would suffice:
diff --git a/sys/dev/mps/mps_sas_lsi.c b/sys/dev/mps/mps_sas_lsi.c index 1be5048aad2d..a13a811b0245 100644 --- a/sys/dev/mps/mps_sas_lsi.c +++ b/sys/dev/mps/mps_sas_lsi.c @@ -1208,8 +1208,13 @@ mpssas_SSU_to_SATA_devices(struct mps_softc *sc, int howto) * routine. */ while (sc->SSU_refcount > 0) { - pause("mpswait", hz/10); - + if (SCHEDULER_STOPPED()) { + DELAY(10000); + mpssas_poll(sassc->sim); + } else { + pause("mpswait", hz/10); + } + if (--timeout == 0) { mps_dprint(sc, MPS_FAULT, "Time has expired waiting " "for SSU commands to complete.\n");
mpr, on the other hand, bogusly uses getmircotime to do the 60s timeout. It should be changed to whatever we come up with for mps. There's about 2 dozen drivers with shutdown_final that we need to audit. mpt, thankfully, isn't one of them.
Right.
Looking closely, it seems that the mps driver allocates a bunch of CCBs and then fires and forgets, assuming their completion routines will run... Which won't happen.
Yes, that's exactly what we hit on 11.0ish (and probably earlier).
The commands will be sent, but we'll not process the completions because interrupts are disabled. this is a bug in the mps driver. In this case, though, we'll just get a scary message saying there was a timeout...
Ah, this is due to an enhancement to the mps driver between 11.0ish and 12.0 (r330951). In 11.0ish, the refcount drain loop actually does a time comparison using getmicrotime(), which doesn't work during panic.
And in 10.0ish, pause() doesn't have the fallback to DELAY() in a panic environment.
To fix this, we'd need to poll somehow. In this case, I think the following would suffice:
The pause->DELAY thing isn't strictly needed on CURRENT, and I don't know about the mps-specifics of poll, but the broad approach looks good to me.
Amendment to my prior thing. We'll need a cam_sim_poll() function. We have to call the sim_poll function, and the call camisr_runqueue() and there's some 'hidden' locking that goes on... I'll open a review with cem, markj, scottl, ken, asomers and mav with it and the fixes...