nvme: provide mutual exclusion for interrupt handler
AbandonedPublic
Actions

Authored by imp on Oct 10 2022, 3:26 PM.

Details

Reviewers

mav
chs
chuck
jhb

Summary

Originally, the interrupt handler for the qpair completions was only
called from an interrupt context and was written to avoid locking as
long as possible. However, for crash dumps and at other times we need to
poll for completions. While the existing qpair lock is adquate to
protect the soft state of the qpair, it is inadequate to protect the
hardware against multiple people reading and processing completions. The
completion loop is not sufficiently robust to cope with that
situation. To make it robust would require a lot of atomics. Allowing
only one thread to execute this completion loop would prevent the races
and restore the same preconditions the old code had while allowing the
new code that does bulk processing of timeouts to work and recover.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 53076
Build 49967: arc lint + arc unit

Event Timeline

imp created this revision.Oct 10 2022, 3:26 PM

Herald added a subscriber: dab. · View Herald TranscriptOct 10 2022, 3:26 PM

imp requested review of this revision.Oct 10 2022, 3:26 PM

Harbormaster completed remote builds in B47763: Diff 111611.Oct 10 2022, 3:26 PM

imp added reviewers: mav, chs, chuck.Oct 10 2022, 3:29 PM

better comment.

Harbormaster completed remote builds in B47766: Diff 111614.Oct 10 2022, 4:04 PM

Why not just use a mutex, using mtx_trylock() for the acquire? Mutexes already check the panic special case..

I think this may create a race between new command completion added into the queue and already running handler, when interrupt arriving last moment get dropped, but newly completed command is not yet processed.

In D36924#838657, @mav wrote:

I think this may create a race between new command completion added into the queue and already running handler, when interrupt arriving last moment get dropped, but newly completed command is not yet processed.

So are you thinking something like the following:

timeout thread                                                    interrupt thread
o fires and starts processing events
o we find one whose phase mismatches
   and break out of the loop
                                                                             command completes and posts its completion record
                                                                             we interrupt into this routine and see we're blocked and return
o we finish exiting from the loop and
   update the head pointer. and return

Since we have pending completions in the hardware completion queue we didn't process, we have transactions still so the timeout routine will re-arm the timer. Worst case, we'll introduce a timeout period worth of latency. I'm not sure if the write to the hardware will trigger an interrupt, which gives us a similar race if so, is neutral if not. The spec seems to only say "A completion queue entry is posted to the Completion Queue...The controller may generate an interrupt to the host..." which suggests that a host write to head pointer won't generate an interrupt. And if there's other transactions that complete before the timeout, then the latency will much much smaller. Either way, the completion will be processed and we won't lose it. Since this is an increased latency hit, it's useful to look at frequency of this: Almost always, an interrupt will be firing and not racing a timeout, so in the vast majority of cases, there'd be no penalty. I'll add a counter in a subsequent commit which will show the number of 'missed' interrupts which will give an upper bound to the frequency of the occurrence and a possible bound on the latency hit. At least it will fail safe.

The obvious fix for this race is to move the atomic store before the write to the head pointer. There's problems with that. If we set the all clear flag, fetch qpair->cq_head , and then are scheduled out and an interrupt comes in, it will be allowed in and might do all the processing before we have a chance to wakeup and write the value. That thread will write the new value of head, and then we'll resume and write the old value, which isn't going backwards so much as a lot forwards which I think is undefined if there's not enough entries in the completion queue, or we'll miss a boatload of completions leading to timeouts on the commands and a possible device reset. So the 'obvious' fixed is no fix at all given its downsides.

Move in_isr to nvme_qpair not nvme_controller. I'd intended it to be in qpair
and don't know what I was thinking. Otherwise, if multiple ISRs fire at once,
you'll hit a race similar to what mav@ said (maybe this is what he was saying?).

Harbormaster completed remote builds in B47768: Diff 111623.Oct 10 2022, 10:24 PM

adam.e.peace_gmail.com added a subscriber: adam.e.peace_gmail.com.Jan 31 2023, 10:33 PM

rebase

Harbormaster completed remote builds in B53076: Diff 125938.Aug 12 2023, 4:42 PM

imp mentioned this in D36922: nvme: Greatly improve error recovery.Aug 12 2023, 4:56 PM

emaste added a subscriber: emaste.Aug 12 2023, 5:14 PM

Again, you've hand-rolled what is essentially a mutex_trylock(). Why not just use a mutex?

So just make a spin mutex and exit if try lock fails? Instead of in_isr? Something that basic? I keep thinking that I am missing something...

Ah, mtx_trylock_spin() returns true if the scheduler is stopped... I'd thought that was a $WORK extension.
I wonder if the slight difference I never remember until I deep dive back in between SCHEDULER_STOPPED and in_panic matters.

In D36924#943708, @imp wrote:

Ah, mtx_trylock_spin() returns true if the scheduler is stopped... I'd thought that was a $WORK extension.
I wonder if the slight difference I never remember until I deep dive back in between SCHEDULER_STOPPED and in_panic matters.

Looks like SCHEDULER_STOPPED is set in vpanic(), which prints some stuff then calls kern_reboot which checks howto and a couple of other things before calling doadump which sets dumping (which is the || in my in_panic).
Absent a manual call to doadump, it looks like SCHEDULER_STOPPED will always be true when dumping is true. Otherwise the scheduler is running and we could hit deadlock (though if the scheduler is running, we likely do want the mutex protection in place). That's likely OK, though, since the only time we're trying to exclude things are for resets that result from timed out commands and if we're resetting the nvme controller while trying to dump, my naive hunch is it's already game over... though I haven't thought through how to keep that from hanging forever (though in the cases I care about, a watchdog will save us, and that may be an OK answer, but more thought is needed).

imp added a reviewer: jhb.Aug 13 2023, 4:03 PM

Most likely for the manual dump case, we'd just check !dumping in the error recovery code and not bother with a reset, but fail all further I/O since the controller failed at that point and the safest thing to do is just stop. I'll evaluate that after I think a bit more...

If I understand drew's comments then https://reviews.freebsd.org/D41452 does this and D36929.