Avoid concurrent nvme_qpair_process_completions runs
AbandonedPublic
Actions

Authored by cperciva on Apr 1 2020, 11:42 PM.

Details

Reviewers: None

Summary

Right now it's possible for nvme_timeout to run at the same time as nvme_qpair_msix_handler, resulting in two simultaneous calls to nvme_qpair_process_completions with resulting mangling of the qpair state. This seems to happen particularly often in EC2, possibly due to weird interrupt issues on the Nitro platform -- but it's a bug no matter what is triggering it.

This adds a new "nvme qpair process completions lock" to the nvme_qpair structure and holds it in nvme_qpair_process_completions while processing completed I/Os.

It also adds, as a temporary debugging mechanism, a printf recording every time nvme_timeout is called. This may help me track down the underlying Nitro issue, but should not be committed.

Test Plan

I'm hoping that people who have run into nvme issues in EC2 can apply this patch and let me know the results.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 30236

Event Timeline

cperciva created this revision.Apr 1 2020, 11:42 PM

Herald added subscribers: dab, imp. · View Herald TranscriptApr 1 2020, 11:42 PM

Harbormaster completed remote builds in B30236: Diff 70104.Apr 1 2020, 11:42 PM

If we need this lock, we've already lost....
What's the root cause of the timeout issue?
I suspect that we're racing because we have multiple outstanding I/Os, one times out and while we're running the completion code either that unwedges the card...

diff --git a/sys/dev/nvme/nvme_private.h b/sys/dev/nvme/nvme_private.h
index 45409f8cc0f0..2872bbfc30ad 100644
--- a/sys/dev/nvme/nvme_private.h
+++ b/sys/dev/nvme/nvme_private.h
@@ -217,6 +217,7 @@ struct nvme_qpair {
        bool                    is_enabled;

        struct mtx              lock __aligned(CACHE_LINE_SIZE);
+       volatile u_int          is_busy __aligned(CACHE_LINE_SIZE);

 } __aligned(CACHE_LINE_SIZE);

diff --git a/sys/dev/nvme/nvme_qpair.c b/sys/dev/nvme/nvme_qpair.c
index 38f6e6ef4ada..ddd1d533819a 100644
--- a/sys/dev/nvme/nvme_qpair.c
+++ b/sys/dev/nvme/nvme_qpair.c
@@ -550,6 +550,9 @@ nvme_qpair_process_completions(struct nvme_qpair *qpair)
        if (!qpair->is_enabled)
                return (false);

+       if (atomic_cmpset_int(&qpair->is_busy, 0, 1) == 0)
+               return (false);
+
        /*
         * A panic can stop the CPU this routine is running on at any point.  If
         * we're called during a panic, complete the sq_head wrap protocol for
@@ -634,6 +637,7 @@ nvme_qpair_process_completions(struct nvme_qpair *qpair)
                bus_space_write_4(qpair->ctrlr->bus_tag, qpair->ctrlr->bus_handle,
                    qpair->cq_hdbl_off, qpair->cq_head);
        }
+       atomic_cmpset_int(&qpair->is_busy, 1, 0);
        return (done != 0);
 }

I think this would do the same thing in a cheaper way.

Alternatively, we can disable interrupts while we're doing this on the theory that our touching the qpairs is triggering an interrupt...

diff --git a/sys/dev/nvme/nvme_qpair.c b/sys/dev/nvme/nvme_qpair.c
index 38f6e6ef4ada..d53cb661efdd 100644
--- a/sys/dev/nvme/nvme_qpair.c
+++ b/sys/dev/nvme/nvme_qpair.c
@@ -907,21 +907,26 @@ nvme_timeout(void *arg)
        struct nvme_controller  *ctrlr = qpair->ctrlr;
        uint32_t                csts;
        uint8_t                 cfs;
+       bool                    processed;

        /*
-        * Read csts to get value of cfs - controller fatal status.
-        * If no fatal status, try to call the completion routine, and
-        * if completes transactions, report a missed interrupt and
-        * return (this may need to be rate limited). Otherwise, if
-        * aborts are enabled and the controller is not reporting
-        * fatal status, abort the command. Otherwise, just reset the
-        * controller and hope for the best.
+        * Read csts to get value of cfs - controller fatal status.  If no fatal
+        * status, try to call the completion routine with interrupts disbaled,
+        * and if completes transactions, report a missed interrupt and return
+        * (this may need to be rate limited). Otherwise, if aborts are enabled
+        * and the controller is not reporting fatal status, abort the
+        * command. Otherwise, just reset the controller and hope for the best.
         */
        csts = nvme_mmio_read_4(ctrlr, csts);
        cfs = (csts >> NVME_CSTS_REG_CFS_SHIFT) & NVME_CSTS_REG_CFS_MASK;
-       if (cfs == 0 && nvme_qpair_process_completions(qpair)) {
-               nvme_printf(ctrlr, "Missing interrupt\n");
-               return;
+       if (cfs == 0) {
+               nvme_mmio_write_4(ctrlr, intms, 1);
+               processed = nvme_qpair_process_completions(qpair);
+               nvme_mmio_write_4(ctrlr, intmc, 1);
+               if (processed) {
+                       nvme_printf(ctrlr, "Missing interrupt\n");
+                       return;
+               }
        }
        if (ctrlr->enable_aborts && cfs == 0) {
                nvme_printf(ctrlr, "Aborting command due to a timeout.\n");

But I'm unsure... We're doing recovery here, and we need to do something special and it's unclear the proper special thing...

imp added inline comments.Apr 2 2020, 3:48 PM

sys/dev/nvme/nvme_qpair.c
544	If we go with a mutex, it would be better to try_lock here and bail if we can't get the lock. That is a higher-level wrapper on the atomic_cmpset() that I did in the comments.

In D24256#533690, @imp wrote:

If we need this lock, we've already lost....
What's the root cause of the timeout issue?
I suspect that we're racing because we have multiple outstanding I/Os, one times out and while we're running the completion code either that unwedges the card...

That was my first thought. It occurred to me last night that there's another possibility -- is there any way that the callback req->cb_fn(req->cb_arg, cpl); is taking a long time? I don't know how far up the stack that callback goes.

cperciva added inline comments.Apr 2 2020, 5:45 PM

sys/dev/nvme/nvme_qpair.c
544	This, and the atomic_cmpset you suggested in the comments, would introduce a different race: Thread #1 exits the while (1) loop. An I/O completes. Thread #2 checks the lock (or the "busy" atomic) and returns. Thread #1 unlocks the lock. which would result in the driver getting stuck.

imp added inline comments.Apr 2 2020, 7:00 PM

sys/dev/nvme/nvme_qpair.c
544	How? Thread 2 doesn't need to run in this case, so there's no getting stuck.

cperciva added inline comments.Apr 2 2020, 7:10 PM

sys/dev/nvme/nvme_qpair.c
544	The I/O hadn't completed yet when Thread #1 checked. So neither thread is going to handle the completion.

imp added inline comments.Apr 2 2020, 8:04 PM

sys/dev/nvme/nvme_qpair.c
544	The next timeout will catch this, or the next completion on this queue.

cperciva added inline comments.Apr 2 2020, 8:31 PM

sys/dev/nvme/nvme_qpair.c
544	Only if there is a next timeout or completion. We could hang indefinitely with one outstanding completion if there aren't any more I/Os added to this queue. (Which could happen even if there's still significant I/O to the device, since we have a LIFO for free trackers -- if we have only a small number of concurrent I/Os they might keep on reusing the same trackers all on the same queue.)

imp added inline comments.Apr 2 2020, 9:18 PM

sys/dev/nvme/nvme_qpair.c
544	The driver will have schedules others for in-flight I/O, or it will cancel the I/O that's in flight which will produce an interrupt... So either way, we'll have something to clock the next round of completions, and any stragglers will get picked up here. It's how the driver normally works (eg, if we complete the I/O and exit the loop and the next I/O completes before we can exit the interrupt context, that's handled correctly today and this race seems no different than that).

cperciva added inline comments.Apr 2 2020, 9:31 PM

sys/dev/nvme/nvme_qpair.c
544	Each qpair has its own ithread and its own set of timeouts. There's no way for an interrupt on CPU #0 which was triggered by a completion on qpair #1 to do anything with I/Os which are part of qpair #2 (which are handled by CPU #8)...

cperciva added inline comments.Apr 2 2020, 9:57 PM

sys/dev/nvme/nvme_qpair.c
544	if we complete the I/O and exit the loop and the next I/O completes before we can exit the interrupt context, that's handled correctly today That's handled correctly because swi_sched sets ih_need = 1, which results in nvme_qpair_msix_handler being called again if an interrupt arrives while it is already running. This doesn't cover the case where an interrupt arrives while nvme_timeout is running.

imp added inline comments.Apr 3 2020, 6:00 AM

sys/dev/nvme/nvme_qpair.c
544	I still want to know why the timeout is happening int he first place. I added the code as a fallback, failsafe, not as a reliable part of the driver. Even if we put this locking in, it's not a reliable, fail-safe part of the driver. I really really really do not want to slow down the normal path of the driver if we can't answer this fundamental question: why are they racing at all? What's causing the interrupt to miraculously fire enough of the time while the interrupt is running to cause all this contention? If we know what we're doing to cause that to happen, it would be a hell of a lot more reliable to just do that in the timeout routine to trigger an interrupt and let the ISR handle completions... The timeout means 'something totally f* up and now I have to sort this st out' and shouldn't slow down the fast path to accommodate it sorting the st out... So maybe instead we should in the timeout try to do a get feature or some other innocuous command that will generate an interrupt and have the timeout from that just cause a full reset of the card and never call the completion routine from the timeout as a hail-mary since it's unsafe. That would get things unstuck without slowing down the current completion path.

Although it's a bit longer, I've spent some time trying to revamp the recover so we don't need a lock in the normal completion path.
https://reviews.freebsd.org/D24284 is the result. I have no idea if it will work or not, but is presented here as perhaps a more robust way to cope.

Updating in case anyone looks back at this later: This fixes some instability on EC2 instances, and helped me to figure out what's going on. It looks like something -- probably the ENA device -- is monopolizing a CPU and starving the nvme completion ithread, and we were getting timeouts despite the interrupt having in fact fired successfully.

The starvation issue suggests that we probably need some sort of interlock between nvme_timeout and the completion processing; not necessarily this one though, since there are probably more efficient options than picking up a mutex.

dmgk added a subscriber: dmgk.Apr 17 2020, 9:28 PM

Overcome by events.