Page MenuHomeFreeBSD

ada/da: Ignore CCBs at wrong priority in *start
AbandonedPublic

Authored by imp on Jul 18 2024, 10:02 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Nov 14, 12:56 PM
Unknown Object (File)
Mon, Nov 4, 12:15 PM
Unknown Object (File)
Sep 30 2024, 6:10 PM
Unknown Object (File)
Sep 28 2024, 7:35 AM
Unknown Object (File)
Sep 28 2024, 7:35 AM
Unknown Object (File)
Sep 28 2024, 7:35 AM
Unknown Object (File)
Sep 28 2024, 7:23 AM
Unknown Object (File)
Sep 24 2024, 4:09 AM
Subscribers
None

Details

Reviewers
mav
jhb
Group Reviewers
cam
Summary

In tracking down lifecycle issues with da and ada, I noticed we'd get
CCBs of the wrong priority for the state of the probe state machine. In
debugging 6c8ab086fed3, I created this patch, but didn't upstream. I
had thought that it was the only cause of the bad ccbs, but I was
mistaken. We still see this message about twice a month in Netflix's
fleet, though the root cause of 6c8ab086fed3 is now gone (despite
the uncertainty expressed in the log: 1-2 a week before, now 0 in
two years).

One cause can be the dynamic I/O scheduler when we're rate limiting
I/O. We'll call the start routine when a timer expires, but that will
interfere with the state machine.

Another cause of this may be related to the I/O coming in too quickly
while we're recovering the device after a different device fails on
mpr/mps.

So to fail safe, since we have to carefully single-step the queue when
we're running the state machine, only accept CCBs that are at priority
CAM_PRIORITY_DEV when we're doing that. Only accept CCBs at priority
CAM_PRIORITY_NORMAL. I/O that would normally be scheduled is now
deferred (it picks back up again when we enter the normal mode).

Also add a whiny message on the off chance ohters were seeing this
problem to gague the priority of a fix for the underlying issue.

nda has no real discovery state machine that re-runs after I/O
processing starts, so no workaround is needed there.

Sponsored by: Netflix

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 58726
Build 55614: arc lint + arc unit

Event Timeline

imp requested review of this revision.Jul 18 2024, 10:02 PM

Misc updates with testing

Note: I think I may hold off committing this one give D46038 would fix all known instances of it. I'll keep this in my tree here at at Netfix to prove it works. If so, I may revise this to be a panic and commit that.... Further testing will tell, since I do not have a good reproducer for this... I only see it sometimes in the fleet when, at least the last few I can look at in detail, we have some error.

I may also want to see why we seem to always call it on the expiration of the quantum... That may be a relatively harmless bug that I'd not considered a bug when I did the investigation into all this stuff a couple of years ago.... These messages almost certainly indicate that this should be viewed as a bug. I'm unsure what I was thinking when I first discovered it and put them in rather than do a fix like D46038. I don't seem to have notes from the time either... :(

I suspect users won't report random printfs, so if you want to commit this upstream it probably needs to be a panic/KASSERT instead so users will notice.

I'll do this as asserts in 6 months or so if the other fixes I just pushed to -current (and Netflix's tree) eliminate all the priority messages in our logs...
Until then abandon this to de-clutter things at least a little