Page MenuHomeFreeBSD

Revamp nvme recovery
Needs ReviewPublic

Authored by imp on Apr 3 2020, 11:31 PM.

Details

Summary

Calling the completion routine from the timout was a good idea for the first
generation of missing interrupt problems that we had. However, since we've solve
the real root cause for them its need is much lessened. In addition, it's always
been racy, but as a 'last resort' it wasn't too bad.

However, under heavy load on some cloud platforms we see missing interrupts for
reasons unknown. In that environment, something the timeout was doing was
causing the interrupt to fire, causing a race between the two threads. To fix
this race would require an additional lock / unlock pair in the hot path.

To avoid that, we revamp how we do recovery. If the controller has failed, we
still do what we did before: we reset the controller. This also picks up the
hotplug case.

However, for the first timeout, we now send an innocuous command to the card (A
GET_FEATURES for a required feature). We start a new timeout on the original
request (but do nothing further). If that works, then we do nothing further as
the card is unwedge. If sending that command causes a timeout, we reset the
card. If the original command doesn't get a completion before its new timeout
exires, we either abort the command (if enable_abort = 1), or we reset the
card. If sending the abort times out, then we reset the card.

This should be more robust and free races. It remains to be seen if it actully
works.

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 30270
Build 28047: arc lint + arc unit

Event Timeline

imp added inline comments.
sys/dev/nvme/nvme_private.h
340

I should likely make this a counter and publish it as a sysctl.

sys/dev/nvme/nvme_qpair.c
883

Maybe I need a comment here justifying reset on error. The GET FEATURES command is a can't fail sort of thing for a mandatory feature like arbitration.

cperciva added inline comments.
sys/dev/nvme/nvme_qpair.c
918

If nvme_qpair_complete_tracker lost a race, we could enter nvme_timeout with tr->req = NULL, resulting in a panic when we dereference req.

For that matter, with sufficient racing it's possible that tr->req is non-NULL and refers to a *completely different request* to the one which triggered the timeout.

cem added a subscriber: cem.

I'm sure we have some interested parties in this revision, but I'm not sure if they have Phab accounts. Added some possibilities, feel free to remove yourselves / forward.