Page MenuHomeFreeBSD

nvme(4): Add bus_dmamap_sync() at the end of the request path
ClosedPublic

Authored by jhibbits on Aug 2 2018, 4:56 PM.

Details

Summary

Some architectures, in this case powerpc64, need explicit synchronization
barriers vs device accesses.

Prior to this change, when running 'make buildworld -j72' on a 18-core
(72-thread) POWER9, I would see "Missing interrupt" errors and controller resets
often. Also, due to the missing interrupts and resets, it would take ~13
minutes to build world. After this change the "Missing interrupt" messages do
not appear, and buildworld times are 10-11 minutes.

The placement of the syncs may not be 100% correct.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

I actually made it through a buildworld for once without panicing for once with this patch. My kernel's not clean at the moment, but I will rerun with only this patch in a bit.

In D16570#351747, @git_bdragon.rtk0.net wrote:

I actually made it through a buildworld for once without panicing for once with this patch. My kernel's not clean at the moment, but I will rerun with only this patch in a bit.

To elaborate here, I have been having a chronic issue on my Talos II with the nvme crashing in a way where it would stop sending interrupts and mmio reads would return all bits set to 1, and then me hitting a panic inside the controller reset due to my kernel options being set for maximum paranoia. It seems that running with this patch may have elimintated these crashes for me.

jimharris added inline comments.
sys/dev/nvme/nvme_qpair.c
405 ↗(On Diff #46193)

Hi Justin,

I'm no powerpc64 expert, so pardon any stupid questions I'm about to ask.

I think here this bus_dmamap_sync should go inside of the if (req->type != NVME_REQUEST_NULL) check on the next line. There are cases where an NVMe request can have no payload. Obviously your patch is still work but I think it's more correct to only do the sync if there's data associated with the command.

-Jim

492 ↗(On Diff #46193)

What's the impact of the dmamap_sync on a largish queue? For example, each cpl is only 16 bytes (NVMe completion entry size) but the queue can have 100s-1000s of possible entries. Ideally you want to only sync the part of the completion queue that you're about to read.

So I think this is correct, but just wondering if this will introduce some kind of performance issue.

829 ↗(On Diff #46193)

should this sync go after the memcpy?

888 ↗(On Diff #46193)

I think might be better to move this into nvme_qpair_submit_tracker - keep the sync for the payload and the submission queue entry in the same place.

But I think keeping it here is functionally correct.

This revision now requires changes to proceed.Aug 2 2018, 5:23 PM

Hi Jim,

I'll test the requested changes and update the patch shortly.

Thanks for the review!

  • Justin
sys/dev/nvme/nvme_qpair.c
405 ↗(On Diff #46193)

Sounds good. I'll move it in there.

492 ↗(On Diff #46193)

Funny you should mention this, because mmel@ just yesterday(?) mentioned a similar need for the sound system. It really would be better to have a bus_dmamap_sync_range() or similar API, but none exist currently.

This would pose an issue if bounce buffers are in play. If they're not, then on powerpc the overall cost will be a sync instruction. I'm not sure offhand when a bounce buffer is used on powerpc, so this could either be really costly, or almost free. Judging by the run_filter() function in sys/powerpc/powerpc/busdma_machdep.c, bounce buffers shouldn't come into play on powerpc in this case.

Saying that, someone should probably test this on x86, too, to see if there are any performance issues arising from it.

829 ↗(On Diff #46193)

Yes it should. Oops.

888 ↗(On Diff #46193)

I considered that, yes. However, since you did point out above that we could have a request with no payload, I tried to keep the payload sync isolated to when needed.

sys/dev/nvme/nvme_qpair.c
888 ↗(On Diff #46193)

Oh - you're right. Yeah - it should stay here.

Address jimharris's comments.

  • Move the payload sync to inside the NVME_REQUEST_NULL guard.
  • Move the queuemem sync to after the memcpy, where it belongs. This replaces the wmb() with the sync.

I recompiled with just this diff, and I was back to crashing during buildworld. Going to recompile with my local changes again so if it crashes again I have my tools to examine the nvme queues more extensively from the kernel debugger.

   0  8051 18.14 737 13.06   0.00   0  0.00  37  0 26  0 37
   0 18956 53.56 759 39.70   0.00   0  0.00  29  0 15  0 56
   0  4322 63.36 637 39.44   0.00   0  0.00   8  0  4  0 88
   0    61 57.16 168  9.37   0.00   0  0.00   6  0  1  0 93
   0  4645 75.95 326 24.21   0.00   0  0.00   6  0  1  0 93
   0 28317 24.88 1054 25.61   0.00   0  0.00   9  0 12  0 79
   0 19142 17.52 395  6.76   0.00   0  0.00  36  0 23  0 41
nvme0: cpl does not map to outstanding cmd
cdw0:00000000 sqhd:00cf sqid:0005 cid:001c p:1 sc:00 sct:0 m:0 dnr:0
panic: received completion for unknown cmd

cpuid = 40
time = 1533234367
KDB: stack backtrace:
0xe00000016cc74380: at 0xc0000000029b9754 = .kdb_backtrace+0x5c
0xe00000016cc744b0: at 0xc000000002957bf8 = .vpanic+0x1b4
0xe00000016cc74570: at 0xc000000002957dd0 = .kassert_panic+0xf8
0xe00000016cc74620: at 0xc0000000026aacd4 = .nvme_qpair_process_completions+0x15c
0xe00000016cc746c0: at 0xc0000000026aaee0 = .nvme_qpair_process_completions+0x368
0xe00000016cc74740: at 0xc000000002913ee0 = .intr_event_handle+0x534
0xe00000016cc74830: at 0xc00000000290e51c = .fork_exit+0x104
0xe00000016cc748f0: at 0xc000000002d53230 = .fork_trampoline+0x10
0xe00000016cc74920: at -0x4
KDB: enter: panic
[ thread pid 12 tid 100199 ]
Stopped at      0xc0000000029b9320 = .kdb_enter+0x60:   ld      r2, r1, 0x28
db>
jimharris added inline comments.
sys/dev/nvme/nvme_qpair.c
838 ↗(On Diff #46212)

Is it safe to remove the wmb? At least on x86, there is no explicit wmb in _bus_dmamap_sync. On x86, _bus_dmamap_sync is basically a nop unless there are bounce buffers in play.

I think the wmb() is still needed - probably after the bus_dmamap_sync.

This revision now requires changes to proceed.Aug 2 2018, 7:48 PM
sys/dev/nvme/nvme_qpair.c
838 ↗(On Diff #46212)

Okay. On powerpc, there already is an explicit sync in the bus_dmamap_sync().

In this case, it probably makes sense to guard it on either !powerpc, or on x86.

Bring back the wmb() for !powerpc. powerpc's bus_dmamap_sync already includes
an explicit hwsync, but other archs do not.

Looks good to me. Would be great if @git_bdragon.rtk0.net could retest with this latest patch.

ok, so unfortunately, I had another crash when testing with only these changes. My original crash where the controller straight up stops working and won't even reset.

I'm going to try again with a couple of extra bits that were in my local diff before the test and see if I can still get it to crash.

ok, crashed with the local bits I was wondering about as well.

Going to do some tests with hw.nvme.force_intx=1 now, followed by retesting with the exact patch that I couldn't get to crash before.

Crashed in intx mode as well.

I'm going to double check that I'm running the correct PNOR on my Talos II, not sure if I'm behind a version or not from the offical firmware.

I was not on the latest firmware. However, updating the firmware didn't seem to help.

However, the PCIE controller documentation for POWER9 is finally out, and the things I am seeing are stating to make sense! I'm pretty sure this crash is because a Freeze class error is occuring and FreeBSD doesn't have the recovery code implemented yet for that!

One definite improvement for me with this patch is that fsck seems to be able to recover the root filesystem after a crash without manual intervention, which I was not seeing before.

bdragon, one suggestion would be to wipe out the disk and do a fresh install with this patch for the installation environment. I curdled my filesystem and kernel experimenting before this, and it resulted in gibberish during boot before rootfs/fsck were ever in the picture.

@kbowling Nah, my fs is fine. I'm saying that this patch is an improvement in that when it crashes, it leaves the fs in a saner state than a crash without the patch.

I figured out the correct SCOM address for interrogating what error happened that is causing my controller freeze, but of course now that I have that figured out, it's not happening.

Made it through two make -j64 buildworld builds in a row without crashing.

Still don't know what's up with my system.

Two more make -j64 buildworld runs complete. I don't know why it's suddenly decided to stop crashing for me, but I will focus on the hardware side if it comes back.

I just reverted this patch to test as well. I ran 2 'make buildworld -j72 -s' and 1 'make buildworld buildkernel -j72 -s'. The performance difference I saw was not as great as I originally wrote, something else had improved performance elsewhere in the last couple weeks. I did, however, see a 5-10% improvement, so it's not entirely insignificant. I also did see "Missing interrupt" messages with the patch, they just did not occur as frequently as previously without.

That, however, is secondary to the real improvement: When I reverted, over the three build runs I made, I saw two nvme0 resets, preceded by data management errors. So I think even though this doesn't solve the problem completely, it does help a great deal.

I think we have to sync the PRP lists too. Give me a minute and I'll send you some pointers.

In nvme_payload_map, can you also sync the tr->prp_dma_map? Using the same qpair->dma_tag.

The PRP list is NVMe's version of scatter gather. The submission queue entry has 2 PRPs embedded, but if the I/O requires more than 2, then the second PRP points to a PRP list where additional PRPs can be specified.

I'm sure you already know this, but you only need to sync before we ring the doorbell for I/O submission. The device never writes to the PRP list - it only reads it.

There is no prp_dma_map. The prp_bus_addr exists in the qpair->queuemem_map, so does get synced. Or am I reading it wrong?

Sorry - I was looking at a much older version of the nvme(4) driver. You are correct - PRPs are getting synced as part of the queuemem_map.

That makes me even more concerned about the lack of a bus_dmamap_sync_range(). :-(

@jimharris is this good to commit as-is now, or should I look at more changes?

This looks good so far. Definitely a big improvement over what was there previously. I'll think about this some more, if there are any other syncs that we're missing.

This revision is now accepted and ready to land.Aug 3 2018, 4:54 PM
This revision was automatically updated to reflect the committed changes.