nvme(4): Add bus_dmamap_sync() at the end of the request path
ClosedPublic
Actions

Authored by jhibbits on Aug 2 2018, 4:56 PM.

Details

Reviewers

imp
jimharris
mav

Commits

rS337273: nvme(4): Add bus_dmamap_sync() at the end of the request path

Summary

Some architectures, in this case powerpc64, need explicit synchronization
barriers vs device accesses.

Prior to this change, when running 'make buildworld -j72' on a 18-core
(72-thread) POWER9, I would see "Missing interrupt" errors and controller resets
often. Also, due to the missing interrupts and resets, it would take ~13
minutes to build world. After this change the "Missing interrupt" messages do
not appear, and buildworld times are 10-11 minutes.

The placement of the syncs may not be 100% correct.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 18535
Build 18237: arc lint + arc unit

Event Timeline

jhibbits created this revision.Aug 2 2018, 4:56 PM

Harbormaster completed remote builds in B18529: Diff 46193.Aug 2 2018, 4:56 PM

I actually made it through a buildworld for once without panicing for once with this patch. My kernel's not clean at the moment, but I will rerun with only this patch in a bit.

In D16570#351747, @git_bdragon.rtk0.net wrote:

I actually made it through a buildworld for once without panicing for once with this patch. My kernel's not clean at the moment, but I will rerun with only this patch in a bit.

To elaborate here, I have been having a chronic issue on my Talos II with the nvme crashing in a way where it would stop sending interrupts and mmio reads would return all bits set to 1, and then me hitting a panic inside the controller reset due to my kernel options being set for maximum paranoia. It seems that running with this patch may have elimintated these crashes for me.

jimharris requested changes to this revision.Aug 2 2018, 5:23 PM

jimharris added inline comments.

sys/dev/nvme/nvme_qpair.c
405	Hi Justin, I'm no powerpc64 expert, so pardon any stupid questions I'm about to ask. I think here this bus_dmamap_sync should go inside of the if (req->type != NVME_REQUEST_NULL) check on the next line. There are cases where an NVMe request can have no payload. Obviously your patch is still work but I think it's more correct to only do the sync if there's data associated with the command. -Jim
494	What's the impact of the dmamap_sync on a largish queue? For example, each cpl is only 16 bytes (NVMe completion entry size) but the queue can have 100s-1000s of possible entries. Ideally you want to only sync the part of the completion queue that you're about to read. So I think this is correct, but just wondering if this will introduce some kind of performance issue.
831	should this sync go after the memcpy?
890	I think might be better to move this into nvme_qpair_submit_tracker - keep the sync for the payload and the submission queue entry in the same place. But I think keeping it here is functionally correct.

This revision now requires changes to proceed.Aug 2 2018, 5:23 PM

Hi Jim,

I'll test the requested changes and update the patch shortly.

Thanks for the review!

Justin

sys/dev/nvme/nvme_qpair.c
405	Sounds good. I'll move it in there.
494	Funny you should mention this, because mmel@ just yesterday(?) mentioned a similar need for the sound system. It really would be better to have a bus_dmamap_sync_range() or similar API, but none exist currently. This would pose an issue if bounce buffers are in play. If they're not, then on powerpc the overall cost will be a sync instruction. I'm not sure offhand when a bounce buffer is used on powerpc, so this could either be really costly, or almost free. Judging by the run_filter() function in sys/powerpc/powerpc/busdma_machdep.c, bounce buffers shouldn't come into play on powerpc in this case. Saying that, someone should probably test this on x86, too, to see if there are any performance issues arising from it.
831	Yes it should. Oops.
890	I considered that, yes. However, since you did point out above that we could have a request with no payload, I tried to keep the payload sync isolated to when needed.

jimharris added inline comments.Aug 2 2018, 7:27 PM

sys/dev/nvme/nvme_qpair.c
890	Oh - you're right. Yeah - it should stay here.

Address jimharris's comments.

Move the payload sync to inside the NVME_REQUEST_NULL guard.
Move the queuemem sync to after the memcpy, where it belongs. This replaces the wmb() with the sync.

Harbormaster completed remote builds in B18535: Diff 46212.Aug 2 2018, 7:43 PM

I recompiled with just this diff, and I was back to crashing during buildworld. Going to recompile with my local changes again so if it crashes again I have my tools to examine the nvme queues more extensively from the kernel debugger.

   0  8051 18.14 737 13.06   0.00   0  0.00  37  0 26  0 37
   0 18956 53.56 759 39.70   0.00   0  0.00  29  0 15  0 56
   0  4322 63.36 637 39.44   0.00   0  0.00   8  0  4  0 88
   0    61 57.16 168  9.37   0.00   0  0.00   6  0  1  0 93
   0  4645 75.95 326 24.21   0.00   0  0.00   6  0  1  0 93
   0 28317 24.88 1054 25.61   0.00   0  0.00   9  0 12  0 79
   0 19142 17.52 395  6.76   0.00   0  0.00  36  0 23  0 41
nvme0: cpl does not map to outstanding cmd
cdw0:00000000 sqhd:00cf sqid:0005 cid:001c p:1 sc:00 sct:0 m:0 dnr:0
panic: received completion for unknown cmd

cpuid = 40
time = 1533234367
KDB: stack backtrace:
0xe00000016cc74380: at 0xc0000000029b9754 = .kdb_backtrace+0x5c
0xe00000016cc744b0: at 0xc000000002957bf8 = .vpanic+0x1b4
0xe00000016cc74570: at 0xc000000002957dd0 = .kassert_panic+0xf8
0xe00000016cc74620: at 0xc0000000026aacd4 = .nvme_qpair_process_completions+0x15c
0xe00000016cc746c0: at 0xc0000000026aaee0 = .nvme_qpair_process_completions+0x368
0xe00000016cc74740: at 0xc000000002913ee0 = .intr_event_handle+0x534
0xe00000016cc74830: at 0xc00000000290e51c = .fork_exit+0x104
0xe00000016cc748f0: at 0xc000000002d53230 = .fork_trampoline+0x10
0xe00000016cc74920: at -0x4
KDB: enter: panic
[ thread pid 12 tid 100199 ]
Stopped at      0xc0000000029b9320 = .kdb_enter+0x60:   ld      r2, r1, 0x28
db>

jimharris requested changes to this revision.Aug 2 2018, 7:48 PM

jimharris added inline comments.

sys/dev/nvme/nvme_qpair.c
838	Is it safe to remove the wmb? At least on x86, there is no explicit wmb in _bus_dmamap_sync. On x86, _bus_dmamap_sync is basically a nop unless there are bounce buffers in play. I think the wmb() is still needed - probably after the bus_dmamap_sync.

This revision now requires changes to proceed.Aug 2 2018, 7:48 PM

jhibbits added inline comments.Aug 2 2018, 7:51 PM

sys/dev/nvme/nvme_qpair.c
838	Okay. On powerpc, there already is an explicit sync in the bus_dmamap_sync(). In this case, it probably makes sense to guard it on either !powerpc, or on x86.

Bring back the wmb() for !powerpc. powerpc's bus_dmamap_sync already includes
an explicit hwsync, but other archs do not.

Harbormaster completed remote builds in B18536: Diff 46213.Aug 2 2018, 7:58 PM

Looks good to me. Would be great if @git_bdragon.rtk0.net could retest with this latest patch.

yep, working on it now.

ok, so unfortunately, I had another crash when testing with only these changes. My original crash where the controller straight up stops working and won't even reset.

crash.txt75 KBDownload

I'm going to try again with a couple of extra bits that were in my local diff before the test and see if I can still get it to crash.

ok, crashed with the local bits I was wondering about as well.

Going to do some tests with hw.nvme.force_intx=1 now, followed by retesting with the exact patch that I couldn't get to crash before.

Crashed in intx mode as well.

crash_intx.txt267 KBDownload

I'm going to double check that I'm running the correct PNOR on my Talos II, not sure if I'm behind a version or not from the offical firmware.

I was not on the latest firmware. However, updating the firmware didn't seem to help.

However, the PCIE controller documentation for POWER9 is finally out, and the things I am seeing are stating to make sense! I'm pretty sure this crash is because a Freeze class error is occuring and FreeBSD doesn't have the recovery code implemented yet for that!

One definite improvement for me with this patch is that fsck seems to be able to recover the root filesystem after a crash without manual intervention, which I was not seeing before.

bdragon, one suggestion would be to wipe out the disk and do a fresh install with this patch for the installation environment. I curdled my filesystem and kernel experimenting before this, and it resulted in gibberish during boot before rootfs/fsck were ever in the picture.

@kbowling Nah, my fs is fine. I'm saying that this patch is an improvement in that when it crashes, it leaves the fs in a saner state than a crash without the patch.

I figured out the correct SCOM address for interrogating what error happened that is causing my controller freeze, but of course now that I have that figured out, it's not happening.

Made it through two make -j64 buildworld builds in a row without crashing.

Still don't know what's up with my system.

Two more make -j64 buildworld runs complete. I don't know why it's suddenly decided to stop crashing for me, but I will focus on the hardware side if it comes back.

I just reverted this patch to test as well. I ran 2 'make buildworld -j72 -s' and 1 'make buildworld buildkernel -j72 -s'. The performance difference I saw was not as great as I originally wrote, something else had improved performance elsewhere in the last couple weeks. I did, however, see a 5-10% improvement, so it's not entirely insignificant. I also did see "Missing interrupt" messages with the patch, they just did not occur as frequently as previously without.

That, however, is secondary to the real improvement: When I reverted, over the three build runs I made, I saw two nvme0 resets, preceded by data management errors. So I think even though this doesn't solve the problem completely, it does help a great deal.

I think we have to sync the PRP lists too. Give me a minute and I'll send you some pointers.

In nvme_payload_map, can you also sync the tr->prp_dma_map? Using the same qpair->dma_tag.

The PRP list is NVMe's version of scatter gather. The submission queue entry has 2 PRPs embedded, but if the I/O requires more than 2, then the second PRP points to a PRP list where additional PRPs can be specified.

I'm sure you already know this, but you only need to sync before we ring the doorbell for I/O submission. The device never writes to the PRP list - it only reads it.

There is no prp_dma_map. The prp_bus_addr exists in the qpair->queuemem_map, so does get synced. Or am I reading it wrong?

Sorry - I was looking at a much older version of the nvme(4) driver. You are correct - PRPs are getting synced as part of the queuemem_map.

That makes me even more concerned about the lack of a bus_dmamap_sync_range(). :-(

I just created https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=230334 for adding a bus_dmamap_sync_range() KPI.

@jimharris is this good to commit as-is now, or should I look at more changes?

This looks good so far. Definitely a big improvement over what was there previously. I'll think about this some more, if there are any other syncs that we're missing.

This revision is now accepted and ready to land.Aug 3 2018, 4:54 PM

Closed by commit rS337273: nvme(4): Add bus_dmamap_sync() at the end of the request path (authored by jhibbits). · Explain WhyAug 3 2018, 8:04 PM

This revision was automatically updated to reflect the committed changes.

mmel mentioned this in D27446: NVME: Multiple busdma related fixes..Dec 2 2020, 4:35 PM