Avoid panicing for commands that poll-out.
AbandonedPublic
Actions

Authored by imp on Apr 18 2020, 2:44 AM.

Details

Reviewers

jimharris
cperciva
mav
chuck
chs
scottl

Summary

Previously, we'd panic if the 'can't fail to complete' commands that
we're polling to complete if they didn't finish after 1 second. This
is not a desirable outcome if you have a system with a lot of nvme
drives. Some may be flakey, and if they are flakey in just the right
way, the entire system can become unavailable (rather than having just
one drive offline). To prevent that, rearrange things so that we can
fail the request, just like we do other commands that might timeout.

These commands already have code to cope with these commands failing,
so we fail the command in a way that this code can be used.

Test Plan

Not sure how to test this...

We have a couple of NVMe drives that hit this, and there's a couple bugs in the PRs. It might be better to just do normal timeouts rather than the fake timeouts that we do. Reviewers please comment. We're doing 'pause' so the scheduler is running and interrupts are being serviced.

Diff Detail

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 30576
Build 28320: arc lint + arc unit

Event Timeline

imp created this revision.Apr 18 2020, 2:44 AM

Herald added a subscriber: dab. · View Herald TranscriptApr 18 2020, 2:44 AM

Harbormaster completed remote builds in B30576: Diff 70723.Apr 18 2020, 2:44 AM

imp added reviewers: jimharris, cperciva, mav, chuck, chs, scottl.Apr 18 2020, 2:45 AM

imp edited the test plan for this revision. (Show Details)

scottl accepted this revision.Apr 19 2020, 5:42 AM

This revision is now accepted and ready to land.Apr 19 2020, 5:42 AM

I worry this is at least incomplete implementation or even wrong approach. The problem I see in what happen if/when the command still complete after the polling gave up, or when the timeout handler expire. I think at very least it will be a stack corruption.

In D24480#538892, @mav wrote:

I worry this is at least incomplete implementation or even wrong approach. The problem I see in what happen if/when the command still complete after the polling gave up, or when the timeout handler expire. I think at very least it will be a stack corruption.

What happens when we try to cancel normal timeouts? We race there as well... panics are also the wrong approach...

I'm thinking, though, that these should all just be normal timeouts that are short so we have the same issues everywhere and at least try to cancel the transaction before returning.

In D24480#538896, @imp wrote:

I'm thinking, though, that these should all just be normal timeouts that are short so we have the same issues everywhere and at least try to cancel the transaction before returning.

Yes, I also though about unification with normal timeouts. There is no way now to have different timeout values now, and there may be a specifics of timeout recovery on early initialization stages (to avoid recursion), but IMO that would be the right way.

generally, having the function that submits an async request return a pointer a request structure that is freed when the request completes is a bad idea, since the caller will have handle the request structure having already been freed, and thus having a pointer to it isn't useful. as you guys discussed, using the normal timeout path for polled commands (and fixing the normal timeout path to not be racy) would be good. a fun corner case will be when the use of the abort command in the timeout path itself times out.

mav@ fixed this in a different way.