Page MenuHomeFreeBSD

nvme: fix a race between failing the controller and failing requests
ClosedPublic

Authored by imp on May 20 2021, 6:57 PM.

Details

Summary

Part of the nvme recovery process for errors is to reset the
card. Sometimes, this results in failing the entire controller. When nda
is in use, we free the sim, which will sleep until all the I/O has
completed. However, with only one thread, the request fail task never
runs once the reset thread sleeps here. Create two threads to allow I/O
to fail until it's all processed and the reset task can proceed.

Sponsored by: Netflix

Test Plan

Use my artificial fail on reset code:
start a huge fio job
set the fail on reset sysctl
nvmecontrol reset

make sure there's no hang

Diff Detail

Repository
rG FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

imp requested review of this revision.May 20 2021, 6:57 PM
imp added reviewers: mav, jhb, markj.

It seems a bit sad to have to use a blocking task for this vs some kind of counter or state machine such that the task doesn't have to block.

sys/dev/nvme/nvme_ctrlr.c
1445
This revision is now accepted and ready to land.May 20 2021, 7:18 PM
In D30366#682171, @jhb wrote:

It seems a bit sad to have to use a blocking task for this vs some kind of counter or state machine such that the task doesn't have to block.

In an ideal world, we'd just deref the sim, and it would go away when the count reaches 0 after the I/O has failed and then cleanup would happen.
But today sim removal is synchronous and cam_sim_free likely has consumers that assume it's really gone once it returns, though I've
not done an audit. So having a deref it and have it eventually go away will have to wait...

The other workaround would be to fail all the I/O inline as part of freeing the sim. I worried about races, though smaller, there and didn't
want it half fixed. Though inelegant and a really big hammer, this approach ensures that kind of race can't happen.

I don't have particular objections, but it makes me wonder why _nvme_qpair_submit_request() in case of failed controller we decouple the completion via the taskqueue/nvme_ctrlr_post_failed_request(), while in case of busdma error we are calling nvme_qpair_manual_complete_tracker() directly.

This revision now requires review to proceed.May 20 2021, 7:37 PM
In D30366#682178, @mav wrote:

I don't have particular objections, but it makes me wonder why _nvme_qpair_submit_request() in case of failed controller we decouple the completion via the taskqueue/nvme_ctrlr_post_failed_request(), while in case of busdma error we are calling nvme_qpair_manual_complete_tracker() directly.

I'm not sure why we need a task to do the failures at all...

This requires a bit of work to cover the races when a consumer is sending
I/O requests to a controller that is transitioning to the failed state.  To
help cover this condition, add a task to defer completion of I/Os submitted
to a failed controller, so that the consumer will still always receive its
completions in a different context than the submission.

So there once was a need for it, but it's not clear to me that the need is still
there now that we have more locking in the driver and can cope with other
issues like hotplug better. It is a good question for another day. So I'm not sure
why that wouldn't also apply to failed busdma loading...

I'll commit this for now, but investigate other ways to fail the requests. It's not at all clear to me that the races that were mentioned in the initial commit of this code are still relevant or not.

This revision was not accepted when it landed; it landed in state Needs Review.May 29 2021, 5:06 AM
This revision was automatically updated to reflect the committed changes.