Page MenuHomeFreeBSD

nvme: Fix race condition in nvme_qpair_process_completions
AbandonedPublic

Authored by jrtc27 on Jul 2 2021, 4:41 AM.
Tags
None
Referenced Files
Unknown Object (File)
Feb 14 2024, 1:08 PM
Unknown Object (File)
Dec 20 2023, 5:28 AM
Unknown Object (File)
Nov 2 2023, 4:25 AM
Unknown Object (File)
Aug 14 2023, 4:36 PM
Unknown Object (File)
Aug 14 2023, 3:47 AM
Unknown Object (File)
May 24 2023, 1:18 AM
Subscribers

Details

Reviewers
imp
kib
Summary

Under heavy load it is sometimes possible to hit the following panic (in
this case, on SiFive's HiFive Unmatched):

nvme0: cpl does not map to outstanding cmd
cdw0:00000000 sqhd:00fd sqid:0001 cid:0076 p:0 sc:00 sct:0 m:0 dnr:0
panic: received completion for unknown cmd

This can happen if the completion's cid is read from memory before the
completion's status, as that cid read could race with the device
updating status to the current phase. Thus we must call bus_dmamap_sync
again after checking the status and before reading any other fields in
order to ensure we have acquire-like semantics.

Note that the panic message is particularly confusing. The call to
nvme_dump_completion currently passes a pointer to the real in-memory
completion (rather than our local endian-converted copy) and so, in the
case of this race, actually prints out the consistent cid, not the stale
cid that was used to get the tracker.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 40227
Build 37116: arc lint + arc unit

Event Timeline

jrtc27 requested review of this revision.Jul 2 2021, 4:41 AM
sys/dev/nvme/nvme_qpair.c
602

Why wouldn't it suffice to do the sync before the first read?

sys/dev/nvme/nvme_qpair.c
586

This code seems to assume a copy from low address to high address. The phase in the

602

Oh, the race here is the device overtaking the host in updating the memory.

So host reads the old record for a few bytes. The device starts to update the record and writes the status before the host reads it. The host sees the new status in the right phase and bogusly uses the old values. This code ensures that won't happen.

Why does this happen on riscv but not x86?

sys/dev/nvme/nvme_qpair.c
602

There are two cases:

  1. If you bounce, then the sync is not guaranteed to see a consistent snapshot of data (though in practice, due to the alignment of the data, an optimised memcpy will do word-by-word copies and thus you'll at least get cid+status consistent on 32-bit architectures, and sqhd+sqid+cid+status consistent on 64-bit architectures, as they for naturally aligned machine words). This is not the normal case (and there are other issues currently with bouncing, ignoring speed).
  1. If you don't bounce, which is what happens in practice everywhere for NVMe (going fast is kinda the point), the sync is just a fence, but that does nothing to enforce the order in which you read the fields of the entry itself. A sync before the first read, which we do way above the loop, ensures you see the entries in the queue that triggered this interrupt, but because we loop (or, given the nature of interrupts, this could be spurious, or delayed and we already processed the entries that caused it, so it'd still be possible without the loop, just rarer) we could see new entries in the completion queue that were written after that sync.

Pictorially (ignoring fields other than CID+STATUS):

Host                  Drive

                   Write A.CID
                        |
                        V
                  Write A.STATUS
                        |
                        V
               +--- Send IRQ
               |        |
Receive IRQ <--+        |
     |                  |
     V                  |
   Sync                 |
     |                  |
     V                  |
 Read A.CID             |
     |                  |
     V                  |
Read A.STATUS           |
     |                  |
     V                  |
  Process               |
     |                  |
     V                  |
Read B.CID              |
     |                  V
     |             Write B.CID
     |                  |
     |                  V
     |            Write B.STATUS
     V
Read B.STATUS

is currently possible due to the lack of ordering between status and cid reads.

I think this is a correct fix. But it's late here and I'd like to sleep on it before saying yes.

sys/dev/nvme/nvme_qpair.c
602

On x86/arm/arm64 both GCC and Clang are happy using unaligned loads to copy packed structs (which all the nvme ones are), but on riscv it will use byte loads as unaligned loads, whilst supported, may be emulated by firmware, causing performance to tank.

Since cid and status are in the same 32-bit word, and in reality the struct _is_ aligned, this means it happens to always be atomic on x86/arm/arm64 so you never see cid and status be inconsistent (e.g. disassembling a 12.2-RELEASE-p1 amd64 kernel I can see it just doing to movq loads and two movq stores).

sys/dev/nvme/nvme_qpair.c
602

Way back in the armv4 days, we added a bunch of aligned(4) or aligned(8) to in-memory host structures to get it to load things with the right instructions. What happens if you try that on the completion structure?

I had half a mind to check the status only, and if it's good, then read the whole CPL record too, but I worry about cache lines in ways my mind is too tired to grok...

Checking the standard, the completion queue itself is page (4k) aligned. And each queue entry is 16 byte aligned, so __aligned(16) may be appropriate (unless the arg is power of 2)

sys/dev/nvme/nvme_qpair.c
602

You'd be able to get away with it on 64-bit architectures, but it's not just cid we need, we also need sqhd, which is in a different 32-bit word, though the same 64-bit word. Marking it as aligned is a good idea anyway to avoid the horrendous codegen currently seen on riscv, but it's not a complete fix.

If this weren't using busdma I'd just use an atomic_load_acq_16 on status and that'd be that...

The comment

*Wait for any DMA operations to complete before the bcopy.

in riscv/busdma_bounce.c above both instances of fence() is nonsensical, of course.

This revision is now accepted and ready to land.Jul 2 2021, 9:09 AM

A slightly different take in D31002 as well. I'm unsure which approach is better.

In D30995#697423, @kib wrote:

The comment

*Wait for any DMA operations to complete before the bcopy.

in riscv/busdma_bounce.c above both instances of fence() is nonsensical, of course.

In fairness, the same lame comment is in arm64. :). Maybe we should remove them both?

sys/dev/nvme/nvme_qpair.c
586

This comment was accidentally left. I typed it up and realized in the middle it was bogus and thought I'd hit cancel.

Abandoning in favour of the slightly more efficient D31002