Page MenuHomeFreeBSD

Fix random DMA failure on powernv
ClosedPublic

Authored by bdragon on Sep 16 2018, 9:20 PM.

Details

Summary

Finally, the root cause for my random NVME failures.

Turns out that we need to avoid using the top 64k of the 32 bit address space for DMA (even 64 bit DMA), as any DMA to those physical addresses will be interpreted as MSI interrupts instead of DMA and cause the PHB to fence.

If we just mark this range as reserved, we can avoid crashes without needing to make changes to how we are setting the machine up.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

bdragon created this revision.Sep 16 2018, 9:20 PM
bdragon edited the summary of this revision. (Show Details)Sep 16 2018, 9:23 PM

Physical map before:

real memory  = 68508909568 (65335 MB)
available KVA = 34359590911 (32767 MB)
Physical memory chunk(s):
0x0000000000003000 - 0x0000000000002fff, 0 bytes (0 pages)
0x000000000000d000 - 0x000000000000ffff, 12288 bytes (3 pages)
0x0000000001e38000 - 0x00000000023dffff, 5931008 bytes (1448 pages)
0x0000000038342000 - 0x000000007ffaffff, 1204215808 bytes (293998 pages)
0x000000007fffa000 - 0x000000007fffffff, 24576 bytes (6 pages)
0x00000000c0000000 - 0x00000007ffffffff, 31138512896 bytes (7602176 pages)
0x0000200002a44000 - 0x00002007a0d36fff, 32718663680 bytes (7987955 pages)
0x00002007fe000000 - 0x00002007ff7fffff, 25165824 bytes (6144 pages)
0x0000000003b9c000 - 0x000000002fffffff, 742801408 bytes (181348 pages)
avail memory = 65763840000 (62717 MB)

Physical map after:

real memory  = 68508848128 (65335 MB)
available KVA = 34359590911 (32767 MB)
Physical memory chunk(s):
0x0000000000003000 - 0x0000000000002fff, 0 bytes (0 pages)
0x000000000000d000 - 0x000000000000ffff, 12288 bytes (3 pages)
0x0000000001e38000 - 0x00000000023dffff, 5931008 bytes (1448 pages)
0x0000000038342000 - 0x000000007ffaffff, 1204215808 bytes (293998 pages)
0x000000007fffa000 - 0x000000007fffffff, 24576 bytes (6 pages)
0x00000000c0000000 - 0x00000000fffeffff, 1073676288 bytes (262128 pages)
0x0000000100000000 - 0x00000007ffffffff, 30064771072 bytes (7340032 pages)
0x0000200002a44000 - 0x00002007a0d36fff, 32718663680 bytes (7987955 pages)
0x00002007fe000000 - 0x00002007ff7fffff, 25165824 bytes (6144 pages)
0x0000000003b9c000 - 0x000000002fffffff, 742801408 bytes (181348 pages)
avail memory = 65763774464 (62717 MB)

Can we make the check the presence of a PHB4 node in the device tree rather than an #ifdef? It might also be best to do this in platform_powernv.c, but that's a bit larger of a project.

How about /interrupt-controller@0 is compatible with "ibm,opal-xive-vc" as a check then? That seems to me like a good way to target P9 xics emulation specifically, without matching on P8.

bdragon added a comment.EditedSep 17 2018, 7:18 PM

hmm, actually, phb3 does the same thing for MSI when initting ioda2 as phb4 does for ioda3. Is this bug also happening on P8?

bdragon added a comment.EditedSep 17 2018, 8:31 PM

Oh, right, I forgot to also put in my original symptoms and train of thought, for people who weren't following along with my chasing of this issue:

Hitting this glitch will manifest as a pci-e device becoming completely unresponsive and unable to be reset. This was happening to me for NVME devices, and happening often enough that surviving a buildworld was impossible except on certain boots where things would somehow work fine.

This is becuase the entire PHB gets fenced when a pci-e device initiates a DMA transfer to phyiscal (actually IIRC it's virtual, but since it's direct-mapped this distinction doesn't really matter) addresses between 0x00000000ffff0000 and 0x00000000ffffffff. DMA to this range is treated as 32 bit MSI interrupts instead of DMA. The lower levels of the controller will attempt to parse the data as an MSI interrupt, with specific restrictions to the transfer length and alignment (16 byte aligned, size < 16 byte. see phb4 spec, page 36). Any violation of these restrictions causes the PHB to notice that something is wrong, and it immediately fences the entire PHB.

I was running with a modified NVME driver so I could catch this case early (I had an observer thread checking controller status every second, so I could panic without having to wait for timeouts to start landing) where I would immediately panic so I could look around in the debugger.

I realized several days ago that I could run the "pci_clearerr" command to cause OPAL to notice the fenced PHB and spit out diagnostic information to the system console, like this:

db> pci_clearerr
[ 1652.330848800,3] PHB#0000[0:0]: PHB Freeze/Fence detected !
[ 1652.330899277,3] PHB#0000[0:0]:             PCI FIR=2000000000000000
[ 1652.330942081,3] PHB#0000[0:0]:         PCI FIR WOF=2000000000000000
[ 1652.330987758,3] PHB#0000[0:0]:            NEST FIR=0000800000000000
[ 1652.331028004,3] PHB#0000[0:0]:        NEST FIR WOF=0000800000000000
[ 1652.331073599,3] PHB#0000[0:0]:            ERR RPT0=0000000000000001
[ 1652.331122088,3] PHB#0000[0:0]:            ERR RPT1=0000000000000000
[ 1652.331159624,3] PHB#0000[0:0]:             AIB ERR=0000200000000000
[ 1652.331984774,3] PHB#0000[0:0]: brdgCtl        = 00000002
[ 1652.332045136,3] PHB#0000[0:0]:             deviceStatus = 00000020
[ 1652.332087953,3] PHB#0000[0:0]:               slotStatus = 00402000
[ 1652.332135626,3] PHB#0000[0:0]:               linkStatus = e0410008
[ 1652.332175707,3] PHB#0000[0:0]:             devCmdStatus = 00100107
[ 1652.332225804,3] PHB#0000[0:0]:             devSecStatus = 00000000
[ 1652.332268253,3] PHB#0000[0:0]:          rootErrorStatus = 00000000
[ 1652.332313165,3] PHB#0000[0:0]:          corrErrorStatus = 00000000
[ 1652.332352640,3] PHB#0000[0:0]:        uncorrErrorStatus = 00000000
[ 1652.332398621,3] PHB#0000[0:0]:                   devctl = 00000020
[ 1652.332439076,3] PHB#0000[0:0]:                  devStat = 00000000
[ 1652.332483778,3] PHB#0000[0:0]:                  tlpHdr1 = 00000000
[ 1652.332525863,3] PHB#0000[0:0]:                  tlpHdr2 = 00000000
[ 1652.332570692,3] PHB#0000[0:0]:                  tlpHdr3 = 00000000
[ 1652.332615526,3] PHB#0000[0:0]:                  tlpHdr4 = 00000000
[ 1652.332660216,3] PHB#0000[0:0]:                 sourceId = 00000000
[ 1652.332699485,3] PHB#0000[0:0]:                     nFir = 0000800000000000
[ 1652.332747116,3] PHB#0000[0:0]:                 nFirMask = 0030001c00000000
[ 1652.332794235,3] PHB#0000[0:0]:                  nFirWOF = 0000800000000000
[ 1652.332844059,3] PHB#0000[0:0]:                 phbPlssr = 0000001800000000
[ 1652.332891422,3] PHB#0000[0:0]:                   phbCsr = 0000001800000000
[ 1652.332938554,3] PHB#0000[0:0]:                   lemFir = 0000000120000100
[ 1652.332985840,3] PHB#0000[0:0]:             lemErrorMask = 0000000000000000
[ 1652.333032856,3] PHB#0000[0:0]:                   lemWOF = 0000000020000000
[ 1652.333080408,3] PHB#0000[0:0]:           phbErrorStatus = 00000c8000000000
[ 1652.333127441,3] PHB#0000[0:0]:      phbFirstErrorStatus = 0000080000000000
[ 1652.333174707,3] PHB#0000[0:0]:             phbErrorLog0 = 2148000098000240
[ 1652.333227134,3] PHB#0000[0:0]:             phbErrorLog1 = a008400000000000
[ 1652.333272119,3] PHB#0000[0:0]:        phbTxeErrorStatus = 0000000000000000
[ 1652.333319258,3] PHB#0000[0:0]:   phbTxeFirstErrorStatus = 0000000000000000
[ 1652.333371432,3] PHB#0000[0:0]:          phbTxeErrorLog0 = 0000000000000000
[ 1652.333434224,3] PHB#0000[0:0]:          phbTxeErrorLog1 = 0000000000000000
[ 1652.333497183,3] PHB#0000[0:0]:     phbRxeArbErrorStatus = 2000000000000000
[ 1652.333544478,3] PHB#0000[0:0]: phbRxeArbFrstErrorStatus = 2000000000000000
[ 1652.333592094,3] PHB#0000[0:0]:       phbRxeArbErrorLog0 = 0100010000000000
[ 1652.333639486,3] PHB#0000[0:0]:       phbRxeArbErrorLog1 = 2f000000ffffe000
[ 1652.333686878,3] PHB#0000[0:0]:     phbRxeMrgErrorStatus = 0000000000000001
[ 1652.333736658,3] PHB#0000[0:0]: phbRxeMrgFrstErrorStatus = 0000000000000001
[ 1652.333783795,3] PHB#0000[0:0]:       phbRxeMrgErrorLog0 = 0000000000000000
[ 1652.333833527,3] PHB#0000[0:0]:       phbRxeMrgErrorLog1 = 0000000000000000
[ 1652.333880588,3] PHB#0000[0:0]:     phbRxeTceErrorStatus = 0000000000000000
[ 1652.333948865,3] PHB#0000[0:0]: phbRxeTceFrstErrorStatus = 0000000000000000
[ 1652.333998593,3] PHB#0000[0:0]:       phbRxeTceErrorLog0 = 0000000000000000
[ 1652.334048280,3] PHB#0000[0:0]:       phbRxeTceErrorLog1 = 0000000000000000
[ 1652.334095272,3] PHB#0000[0:0]:        phbPblErrorStatus = 0000000000000000
[ 1652.334142495,3] PHB#0000[0:0]:   phbPblFirstErrorStatus = 0000000000000000
[ 1652.334197481,3] PHB#0000[0:0]:          phbPblErrorLog0 = 0000000000000000
[ 1652.334244504,3] PHB#0000[0:0]:          phbPblErrorLog1 = 0000000000000000
[ 1652.334291657,3] PHB#0000[0:0]:      phbPcieDlpErrorLog1 = 0000000000000000
[ 1652.334341693,3] PHB#0000[0:0]:      phbPcieDlpErrorLog2 = 0000000000000000
[ 1652.334399597,3] PHB#0000[0:0]:    phbPcieDlpErrorStatus = 0000000000000000
[ 1652.334449315,3] PHB#0000[0:0]:       phbRegbErrorStatus = 0050000000000000
[ 1652.334499269,3] PHB#0000[0:0]:  phbRegbFirstErrorStatus = 0010000000000000
[ 1652.334554650,3] PHB#0000[0:0]:         phbRegbErrorLog0 = 8800004800000000
[ 1652.334612516,3] PHB#0000[0:0]:         phbRegbErrorLog1 = 0000000007011000
[ 1652.334661565,3] PHB#0000[0:0]: eeh_freeze_clear on fenced PHB
[ 1652.334709811,3] PHB#0000[0:0]: eeh_freeze_clear on fenced PHB
[ 1652.334754932,3] PHB#0000[0:0]: eeh_freeze_clear on fenced PHB
[ 1652.334799783,3] PHB#0000[0:0]: eeh_freeze_clear on fenced PHB
[ 1652.334844649,3] PHB#0000[0:0]: eeh_freeze_clear on fenced PHB
[ 1652.334884405,3] PHB#0000[0:0]: eeh_freeze_clear on fenced PHB
[ 1652.334926405,3] PHB#0000[0:0]: eeh_freeze_clear on fenced PHB
[ 1652.334965417,3] PHB#0000[0:0]: eeh_freeze_clear on fenced PHB
[ 1652.335009412,3] PHB#0000[0:0]: eeh_freeze_clear on fenced PHB
[ 1652.335048276,3] PHB#0000[0:0]: eeh_freeze_clear on fenced PHB

The important bit here is the phbRxeArbFrstErrorStatus = 2000000000000000 line. This bit stands for "MSI Size Error" and was documented as "An MSI interrupt was received on the BLIF inbound interface with a size greater than one data beat (16-byte)." (phb4 spec, page 145)

Timothy Pearson at Raptor Engineering pointed out that the PHB4 has a feature in its configuration register (phb4 spec, section 4.4.2.8 PHB4 Configuration Register, p. 73) whereby it can be configured to interpret inbound memory write operations to a specific address range as being MSI interrupts rather than actual DMA.

I checked the register on my machine, and sure enough, the 32-bit MSI Support Enable bit was set. (as well as the 64-bit MSI Support Enable bit, but that is a default value.)

It appears to be set intentionally by OPAL (see hw/phb4.c:phb4_init_ioda3() in skiboot source), likely as part of supporting XICS emulation. It also cuts this 64KB out of the 32 bit memory range in the 32 bit range advice it is providing in the device tree.

Due to the way FreeBSD sets up memory currently, with the 64-bit DMA window starting at 0x0 instead of a point above 0x00000000ffffffff, it's possible (nearly 100% likely in my case at least) that a memory allocation for recieving an nvme completion will eventually end up crossing into the magic address range. The submission is sent, the nvme processes it, the nvme attempts to post the completion, and boom, the entire PHB fences, because as far as the PHB was concerned, the nvme just did a VERY invalid MSI interrupt.

I initially tried to solve this by leaning on the ability for the dmamap to exclude ranges, but it appears there is a separate bug currently where bus_dmamap_sync will crash, possibly due to nobody trying to use it to do a 64 bit sync while skipping a very specific 32 bit range until me. Given that it's generally used to bounce requests that are out of 32 bit range down to 32 bit range where they can be serviced, instead of bouncing requests in 32 bit range with bad addresses *up* to 64 bit range somewhere, this is understandable.

So my solution instead is to just mark the physical address range as forbidden, so there's no chance that the bad addresses can be handed out for DMA transfers.

bdragon updated this revision to Diff 48239.Sep 19 2018, 8:02 PM

Add a check for powernv by running the opal_check().

nwhitehorn accepted this revision.Sep 19 2018, 8:24 PM
This revision is now accepted and ready to land.Sep 19 2018, 8:24 PM
This revision was automatically updated to reflect the committed changes.