nvme: Fix race condition in nvme_qpair_process_completions
AbandonedPublic
Actions

Authored by jrtc27 on Jul 2 2021, 4:41 AM.

Details

Reviewers

imp
kib

Summary

Under heavy load it is sometimes possible to hit the following panic (in
this case, on SiFive's HiFive Unmatched):

nvme0: cpl does not map to outstanding cmd
cdw0:00000000 sqhd:00fd sqid:0001 cid:0076 p:0 sc:00 sct:0 m:0 dnr:0
panic: received completion for unknown cmd

This can happen if the completion's cid is read from memory before the
completion's status, as that cid read could race with the device
updating status to the current phase. Thus we must call bus_dmamap_sync
again after checking the status and before reading any other fields in
order to ensure we have acquire-like semantics.

Note that the panic message is particularly confusing. The call to
nvme_dump_completion currently passes a pointer to the real in-memory
completion (rather than our local endian-converted copy) and so, in the
case of this race, actually prints out the consistent cid, not the stale
cid that was used to get the tracker.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 40227
Build 37116: arc lint + arc unit

Event Timeline

jrtc27 created this revision.Jul 2 2021, 4:41 AM

Herald added a subscriber: dab. · View Herald TranscriptJul 2 2021, 4:41 AM

jrtc27 requested review of this revision.Jul 2 2021, 4:41 AM

Harbormaster completed remote builds in B40227: Diff 91645.Jul 2 2021, 4:41 AM

imp added inline comments.Jul 2 2021, 5:14 AM

sys/dev/nvme/nvme_qpair.c
602	Why wouldn't it suffice to do the sync before the first read?

imp added inline comments.Jul 2 2021, 5:35 AM

sys/dev/nvme/nvme_qpair.c
586	This code seems to assume a copy from low address to high address. The phase in the
602	Oh, the race here is the device overtaking the host in updating the memory. So host reads the old record for a few bytes. The device starts to update the record and writes the status before the host reads it. The host sees the new status in the right phase and bogusly uses the old values. This code ensures that won't happen. Why does this happen on riscv but not x86?

jrtc27 added inline comments.Jul 2 2021, 5:35 AM

sys/dev/nvme/nvme_qpair.c
602	There are two cases: If you bounce, then the sync is not guaranteed to see a consistent snapshot of data (though in practice, due to the alignment of the data, an optimised memcpy will do word-by-word copies and thus you'll at least get cid+status consistent on 32-bit architectures, and sqhd+sqid+cid+status consistent on 64-bit architectures, as they for naturally aligned machine words). This is not the normal case (and there are other issues currently with bouncing, ignoring speed). If you don't bounce, which is what happens in practice everywhere for NVMe (going fast is kinda the point), the sync is just a fence, but that does nothing to enforce the order in which you read the fields of the entry itself. A sync before the first read, which we do way above the loop, ensures you see the entries in the queue that triggered this interrupt, but because we loop (or, given the nature of interrupts, this could be spurious, or delayed and we already processed the entries that caused it, so it'd still be possible without the loop, just rarer) we could see new entries in the completion queue that were written after that sync. Pictorially (ignoring fields other than CID+STATUS): Host Drive Write A.CID \| V Write A.STATUS \| V +--- Send IRQ \| \| Receive IRQ <--+ \| \| \| V \| Sync \| \| \| V \| Read A.CID \| \| \| V \| Read A.STATUS \| \| \| V \| Process \| \| \| V \| Read B.CID \| \| V \| Write B.CID \| \| \| V \| Write B.STATUS V Read B.STATUS is currently possible due to the lack of ordering between status and cid reads.

I think this is a correct fix. But it's late here and I'd like to sleep on it before saying yes.

jrtc27 added inline comments.Jul 2 2021, 5:53 AM

sys/dev/nvme/nvme_qpair.c
602	On x86/arm/arm64 both GCC and Clang are happy using unaligned loads to copy packed structs (which all the nvme ones are), but on riscv it will use byte loads as unaligned loads, whilst supported, may be emulated by firmware, causing performance to tank. Since cid and status are in the same 32-bit word, and in reality the struct _is_ aligned, this means it happens to always be atomic on x86/arm/arm64 so you never see cid and status be inconsistent (e.g. disassembling a 12.2-RELEASE-p1 amd64 kernel I can see it just doing to movq loads and two movq stores).

imp added inline comments.Jul 2 2021, 6:13 AM

sys/dev/nvme/nvme_qpair.c
602	Way back in the armv4 days, we added a bunch of aligned(4) or aligned(8) to in-memory host structures to get it to load things with the right instructions. What happens if you try that on the completion structure? I had half a mind to check the status only, and if it's good, then read the whole CPL record too, but I worry about cache lines in ways my mind is too tired to grok... Checking the standard, the completion queue itself is page (4k) aligned. And each queue entry is 16 byte aligned, so __aligned(16) may be appropriate (unless the arg is power of 2)

jrtc27 added inline comments.Jul 2 2021, 6:21 AM

sys/dev/nvme/nvme_qpair.c
602	You'd be able to get away with it on 64-bit architectures, but it's not just cid we need, we also need sqhd, which is in a different 32-bit word, though the same 64-bit word. Marking it as aligned is a good idea anyway to avoid the horrendous codegen currently seen on riscv, but it's not a complete fix. If this weren't using busdma I'd just use an atomic_load_acq_16 on status and that'd be that...

The comment

*Wait for any DMA operations to complete before the bcopy.

in riscv/busdma_bounce.c above both instances of fence() is nonsensical, of course.

This revision is now accepted and ready to land.Jul 2 2021, 9:09 AM

A slightly different take in D31002 as well. I'm unsure which approach is better.

In D30995#697423, @kib wrote:
The comment
*Wait for any DMA operations to complete before the bcopy.
in riscv/busdma_bounce.c above both instances of fence() is nonsensical, of course.

In fairness, the same lame comment is in arm64. :). Maybe we should remove them both?

imp added inline comments.Jul 2 2021, 2:36 PM

sys/dev/nvme/nvme_qpair.c
586	This comment was accidentally left. I typed it up and realized in the middle it was bogus and thought I'd hit cancel.

Abandoning in favour of the slightly more efficient D31002

imp mentioned this in rGaa0ab681ae75: nvme: coherently read status of completion records.Jul 2 2021, 10:06 PM

imp mentioned this in rGffb294bd3157: nvme: coherently read status of completion records.Jul 12 2021, 7:41 PM

imp mentioned this in rGf76c34659f9d: nvme: coherently read status of completion records.Jul 31 2021, 12:23 AM

Revision Contents
Changeset List

Path

Size

sys/

dev/

nvme/

nvme_qpair.c

14 lines

Diff 91645

View Options

nvme: Fix race condition in nvme_qpair_process_completionsAbandonedPublicActions