Finally, the root cause for my random NVME failures.
Turns out that we need to avoid using the top 64k of the 32 bit address space for DMA, as any DMA to those physical addresses will be interpreted as MSI interrupts instead of DMA and cause the PHB to fence.
If we just mark this range as reserved, we can avoid crashes without needing to make changes to how we are setting the machine up.