Page MenuHomeFreeBSD

nvme: coherently read status
ClosedPublic

Authored by imp on Jul 2 2021, 2:16 PM.

Details

Summary

When reading a completion record, avoid a race with the device. If the
host starts to read the completion record and then the device updates it
while we're reading it, we can have the early part of the record be old
and the later part of the record be new. This leads us to mistakenly
think that the record is in phase and we use the old values and look
at an already completed entry, which has no current tracker.

To work around this problem, we atomically read the status with acquire
semantics. If it's in phase, we then re-read the entire completion
record. In addition we resync the dmatag to reflect changes since the
prior loop for the bouncing dma case.

Found by: jrtc27 (this fix is based in part on her D30995 fix)
Sponsored by: Netflix

Test Plan

A slightly different approach to fix this race than in D30995. We should consider both

Diff Detail

Repository
rG FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

imp requested review of this revision.Jul 2 2021, 2:16 PM

I don't think you can skip it on the first iteration:

Host                  Drive

                   Write A.CID
                        |
                        V
                  Write A.STATUS
                        |
                        V
               +--- Send IRQ
               |        |
Receive IRQ <--+        |
     |                  V
     |             Write B.CID
     |                  |
     |                  V
     |            Write B.STATUS
     |                  |
     V                  |
   Sync                 |
     |                  |
     V                  |
 Read A.CID             |
     |                  |
     V                  |
Read A.STATUS           |
     |                  |
     V                  |
  Process               |
     |                  |
     V                  |
Read B.CID              |
     |                  |
     V                  |
Read B.STATUS           |
     |                  |
     V                  |
  Process               |
     |                  V
     |         +--- Send IRQ
     V         |        |
Receive IRQ <--+        |
     |                  |
     V                  |
Read C.CID              |
     |                  V
     |             Write C.CID
     |                  |
     |                  V
     |            Write C.STATUS
     V
Read C.STATUS

Well, at least in the bouncing case, since the load acquire doesn't actually help you. I think in the coherent non-bouncing DMA case the load acquire does address the issue, but also the sync isn't needed at all there.

FWIW, the completion handler in Linux does something similar, but it orders the memory synchronization differently. Effectively, it does

while ( phase_matches() ) {
    DMA_read_memory_barrier();
    process_completion():
}

where the phase_matches() function reads status and masks the phase bit.

FWIW, the completion handler in Linux does something similar, but it orders the memory synchronization differently. Effectively, it does

while ( phase_matches() ) {
    DMA_read_memory_barrier();
    process_completion():
}

where the phase_matches() function reads status and masks the phase bit.

Yes. That's effectively the order that Jessica's patch has. And I agree with her there is no 'first' here, so I'll update this patch with that.

imp edited the test plan for this revision. (Show Details)

fold in Jessica's observation that sync is need, as well as chuck's feedback.

silly compiler error due to too many retypings of the same code :(.

sys/dev/nvme/nvme_qpair.c
597

with the sync below (and above), do I need the atomic here?

sys/dev/nvme/nvme_qpair.c
591

cid not sqid, we don't look at the latter.

597

No, it can then just be a direct read of the field. I was hoping my version would be optimised by the compiler to be entirely equivalent, but it seems not... so I think reading just the field rather than copying the struct like I did is a good idea. I don't know why it didn't optimise it though, it really should have been able to (and the pass that does so is one I'm far too familiar with...).

607–609

I'd only check the phase. status is 2 bytes so the initial read can technically tear if you're not careful; at least it will currently on riscv due to the packed struct, i.e. the byte of status that doesn't include the phase bit could have been read before and inconsistent with the byte that does include the phase bit.

sys/dev/nvme/nvme_qpair.c
597

For a compliant device, the NVMe specification says:

If a completion queue entry is constructed via multiple writes, the Phase Tag bit shall be updated in the last write of that completion queue entry.

I think this implies that the sync's are sufficient and the atomic isn't needed.

imp marked 4 inline comments as done.Jul 2 2021, 4:51 PM
imp added inline comments.
sys/dev/nvme/nvme_qpair.c
591

tweaked.

597

thanks guys. removed.

607–609

Gotcha. I'd prefer to over-test, but if it doesn't matter, then it's better to not, even though the result is a bit awkward.

imp marked 3 inline comments as done.

review comments: remove atomic, tweak wording, test phase.

sys/dev/nvme/nvme_qpair.c
607–609

Somehow I forgot to mention in my previous comment (but thought it, and in fact was what made me then think about the assertion itself) that the message for this assertion doesn't make sense to me?

imp marked an inline comment as done.Jul 2 2021, 5:02 PM
imp added inline comments.
sys/dev/nvme/nvme_qpair.c
607–609

Looks like I'm so busted for copying and pasting, eh? Fixed.

I nearly wrote this exact diff rather than the one I ended up with; I didn't, for simplicity, but I do think it's better than my less efficient version.

This revision is now accepted and ready to land.Jul 2 2021, 5:14 PM

Subject and description will need updating to reflect the final patch before committing though

imp marked an inline comment as done.Jul 2 2021, 5:15 PM

Subject and description will need updating to reflect the final patch before committing though

Yes. I'll update, but changing that in mid review messes up git-arc unless done very carefully :(

In D31002#697581, @imp wrote:

Subject and description will need updating to reflect the final patch before committing though

Yes. I'll update, but changing that in mid review messes up git-arc unless done very carefully :(

Ah, right...