Page MenuHomeFreeBSD

powerpc64/powernv: Enable Partitionable Endpoint (PE) support
Needs ReviewPublic

Authored by tpearson_raptorengineering.com on Fri, Jan 16, 5:53 PM.
Referenced Files
Unknown Object (File)
Wed, Feb 4, 12:36 AM
Unknown Object (File)
Sat, Jan 24, 10:20 PM
Unknown Object (File)
Sat, Jan 24, 6:01 AM
Unknown Object (File)
Thu, Jan 22, 5:28 PM
Unknown Object (File)
Wed, Jan 21, 10:33 AM
Unknown Object (File)
Mon, Jan 19, 4:33 AM
Unknown Object (File)
Mon, Jan 19, 3:49 AM
Unknown Object (File)
Sat, Jan 17, 1:31 PM

Details

Reviewers
jhibbits
Group Reviewers
PowerPC
Summary

powerpc64/powernv: Enable Partitionable Endpoint (PE) support

This is a fairly major rewrite of the IODA2/IODA3 codebase to enable
PE allocation on a per-bus basis, and to lay the groundwork required
to wire the IODA3 controller into the FreeBSD IOMMU framework.

While the 1:1 DMA mapping is still present, it is now enabled via
a proper TVT (TCE table) with associated setup functions. Likewise,
we no longer place all devices in PE#1, which (among other benefits)
prevents spurious MSI interrupts from interfering with unrelated
devices on the same PHB.

Because the segmented memory model is now in use for 64-bit BARs,
the PE allocation and DMA setup needs to run in a secondary pass
after PCI bus and device resource allocation. This mirrors the
Linux reference code for IODA2/IODA3, and allows further isolation
between PCI devices.

Tested to operate without regressions on an RCS Blackbird system.

Test Plan

Tested on RCS POWER9 hardware (IODA3) and allocations look reasonable / no device regressions noted.

Ideally this should be tested on POWER8; there were some offers on the mailing list for old POWER8 hardware access that could be used here. However, as this patch is foundational for IOMMU support, and most relevant hardware is POWER9, I prefer this merge not be held up waiting for POWER8 testing.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

Adding the wider powerpc umbrella, so others can take a look as well.

My initial review is just style. Also, make sure your style is consistent within a file. You have some return (foo) and some return foo in the same file, stick with one.

Finally, if you upload the diff from the Phabricator UI, generate the diff via git diff -U99999 (or git show -U999999), so that we can see the full context, instead of just the immediate context.

Aside from the style nits, which are all minor, it looks good, and will be a big improvement over what we have currently. I want a second set of eyes to either review or test this, even if testing is just a power8 qemu powernv VM.

sys/powerpc/powernv/opal_pci.c
621–622

else shares the line with the closing '}' above. There are more of these below.

998

Minor style nit: Wrap these bitwise boolean tests in another set of parentheses, so it'd read more as:
if ((sc->off_sc.sc_range_mask & ((uint64_t)1 << i)) != 0)

Double-check the other conditionals as well.

sys/powerpc/powernv/opal_pci.h
97

Keep style consistent here, '{' at the end of the opening line.

sys/powerpc/powernv/opal_pci.c
207

Just remove this whole block, no need to #if 0 it.

I'm in the process of documenting / getting powernv8 and powernv9 qemu guests up and running.
(And I now have a power8 booting freebsd powernv so I can test it on real hardware as well.)

Stay tuned, thanks!

ok, now that i have power8 hardware up and running, what should i be on the lookout for?
Just the same devinfo/dmesg resource assignment, devices found, etc?

ok, now that i have power8 hardware up and running, what should i be on the lookout for?
Just the same devinfo/dmesg resource assignment, devices found, etc?

Basically no regressions -- boot once with the stock kernel, take a note of the drivers loaded, any failures, etc., then boot with the patch, and see if anything new has broken or if the same drivers load / same drivers fail. This, on its own, won't fix e.g. the LSI SAS driver, as that will need more work to figure out why the kernel is trying to map a memory BAR as an I/O bar (???), but it lays the groundwork to set up the IOMMU and hopefully fix some of our memory pressure issues for device DMA.

sys/powerpc/powernv/opal_pci.c
266

I'd initialize pe_data_entry with NULL to make sure it's NULL when the list is empty

272

you could use KASSERT(pe_data_entry->mapping.phb_pe != pe, ...)

292

It would be good to verify if memory allocation was successful here as will, and handle when it fails

305

You should initialize pe_data_entry with NULL to make sure it will be NULL (and not garbage) when pe is not found

320

I think this line can be removed since pe_data_entry is freed next.

358

the "if" line can be removed

412

I suggest to use KASSERT(pe_data_entry->mapping.device_count > 0, ....) as a bug catcher

470

Initialize with NULL

524

In all places opal_call return is compared to 0, it should be compared to OPAL_SUCCESS instead, as defined by OPAL API documentation

650

I'd move this to the beginning of the function (or before panic call)

885

remove dead code

1015

check opal_call return

1290

check if opal_call returned OPAL_SUCCESS (or 0)

1413

check opal_call return

1515

if and kassert redundancy

1523

Would it be possible to print a error message or panic so we can be aware that we are hitting this case?

1592

KASSERT and if redundancy

1882

if and kassert redundancy

1889

check opal_call return

(I still haven't forgotten about this diff; I'm going to test it in VMs and on power8 hardware this week.)

When testing, does anyone else have access to a SATA controller that does DMA? I'm sporadically seeing the Blackbird's AHCI controller lock up but I don't know if this is a PE freeze, bad DMA, or something completely unrelated (flaky cabling?):

ahcich1: is ffffffff cs ffffffff ss ffffffff rs ffffffff tfd ffffffff serr ffffffff cmd ffffffff
(ada0:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 e8 35 8b 40 24 00 00 00 00 00
(ada0:ahcich1:0:0:0): CAM status: Command timeout
ahcich1: AHCI reset...
(ada0:ahcich1:0:0:0): Retrying command, 3 more tries remain
ahcich1: stopping AHCI engine failed
ahcich1: SATA connect timeout time=100000us status=ffffffff
ahcich1: AHCI reset: device not found
pass0 at ahcich1 bus 0 scbus1 target 0 lun 0
pass0: <Hitachi HDS722020ALA330 JKAOA3MA> s/n JK11A8B9H82U0F detached
ada0 at ahcich1 bus 0 scbus1 target 0 lun 0
ada0: <Hitachi HDS722020ALA330 JKAOA3MA> s/n JK11A8B9H82U0F detached
g_vfs_done(): ada0p2 converting all errors to ENXIO
g_vfs_done():ada0p2[WRITE(offset=212411777024, length=32768)]error = 6 suppressing further ENXIO
panic: UFS: root fs would be forcibly unmounted

When testing, does anyone else have access to a SATA controller that does DMA? I'm sporadically seeing the Blackbird's AHCI controller lock up but I don't know if this is a PE freeze, bad DMA, or something completely unrelated (flaky cabling?):

ahcich1: is ffffffff cs ffffffff ss ffffffff rs ffffffff tfd ffffffff serr ffffffff cmd ffffffff
(ada0:ahcich1:0:0:0): READ_FPDMA_QUEUED. ACB: 60 40 e8 35 8b 40 24 00 00 00 00 00
(ada0:ahcich1:0:0:0): CAM status: Command timeout
ahcich1: AHCI reset...
(ada0:ahcich1:0:0:0): Retrying command, 3 more tries remain
ahcich1: stopping AHCI engine failed
ahcich1: SATA connect timeout time=100000us status=ffffffff
ahcich1: AHCI reset: device not found
pass0 at ahcich1 bus 0 scbus1 target 0 lun 0
pass0: <Hitachi HDS722020ALA330 JKAOA3MA> s/n JK11A8B9H82U0F detached
ada0 at ahcich1 bus 0 scbus1 target 0 lun 0
ada0: <Hitachi HDS722020ALA330 JKAOA3MA> s/n JK11A8B9H82U0F detached
g_vfs_done(): ada0p2 converting all errors to ENXIO
g_vfs_done():ada0p2[WRITE(offset=212411777024, length=32768)]error = 6 suppressing further ENXIO
panic: UFS: root fs would be forcibly unmounted

my POWER8 box does, but that may not be good enough?

! In D54745#1259084, @adrian wrote:
my POWER8 box does, but that may not be good enough?

Actually that would be a great test -- I don't know if this issue is IODA3 specific or a general DMA issue with the powernv driver.

ok, its definitely unhappy, stay tuned!

so it didn't finish booting;

cpu0: <Open Firmware CPU> on cpulist0
pcib0: Invalid subordinate bus count 255, defaulting to exact bus match
pcib0: Mapped PE# fd to bus
pcib0: PE# fd TCE invalidation failed: -7
pcib0: Mapped PE# fe to bus
pcib2: Overriding PE# fd to PE# 0 due to 64-bit segmented memory constraint
pcib2: Mapped PE# 0 to bus
pcib2: PE# 0 TCE invalidation failed: -7
pcib2: Mapped PE# fe to bus
pcib4: Invalid subordinate bus count 255, defaulting to exact bus match
pcib4: Mapped PE# fd to bus
pcib4: PE# fd TCE invalidation failed: -7
pcib4: Mapped PE# fe to bus
pcib6: Overriding PE# fd to PE# 0 due to 64-bit segmented memory constraint
pcib6: Invalid subordinate bus count 17, defaulting to exact bus match
pcib6: Mapped PE# 0 to bus
pcib6: PE# 0 TCE invalidation failed: -7
pcib6: Mapped PE# fe to bus
pcib6: No I/O port support, ignoring device I/O resource
pcib6: No I/O port support, ignoring device I/O resource
...
...
pcib6: No I/O port support, ignoring device I/O resource
pcib6: No I/O port support, ignoring device I/O resource
pcib16: Invalid subordinate bus count 255, defaulting to exact bus match
pcib16: Mapped PE# fd to bus
pcib16: PE# fd TCE invalidation failed: -7
pcib16: Mapped PE# fe to bus

then all the broadcom NICs failed to attach:

pci3: <network> at device 0.0 (no driver attached)
bge0: <PCIe2 4-port 1GbE Adapter, ASIC rev. 0x5719001> mem 0x250000000000-0x25000000ffff,0x250000010000-0x25000001ffff,0x250000020000-0x25000002ffff irq 71673 at device 0.0 numa-domain 1 on pci9
bge0: CHIP ID 0x05719001; ASIC REV 0x5719; CHIP REV 0x57190; PCI-E
bge0: firmware handshake timed out, found 0x4b657654
bge0: PHY read timed out (phy 8, reg 1, val 0xffffffff)
bge0: Try again
bge0: PHY write timed out (phy 8, reg 0, val 0x8000)
bge0: PHY read timed out (phy 8, reg 1, val 0xffffffff)
bge0: Try again
bge0: PHY write timed out (phy 8, reg 0, val 0x8000)
bge0: PHY read timed out (phy 8, reg 1, val 0xffffffff)
bge0: Try again
bge0: PHY write timed out (phy 8, reg 0, val 0x8000)
bge0: PHY read timed out (phy 8, reg 1, val 0xffffffff)
bge0: Try again
bge0: PHY write timed out (phy 8, reg 0, val 0x8000)
bge0: PHY read timed out (phy 8, reg 1, val 0xffffffff)
bge0: attaching PHYs failed
device_attach: bge0 attach returned 6

then panic:

   exception       = 0x600 (alignment)
   virtual address = 0xc0080001a9ed1107
   srr0            = 0xc0000000022dd010 (0xeed010)
   srr1            = 0x9000000000001033
   current msr     = 0x9000000000001033
   lr              = 0xc000000001a84648 (0x694648)
   frame           = 0xc00800000000bf80
   curthread       = 0xc000000002d667e0
          pid = 0, comm = kernel

panic: alignment trap
cpuid = 0
time = 1
KDB: stack backtrace:
0xc00800000000bcb0: at vpanic+0x1ac
0xc00800000000bd60: at panic+0x40
0xc00800000000bd90: at trap+0x300
0xc00800000000bec0: at powerpc_interrupt+0x1cc
0xc00800000000bf50: kernel ALI trap @ 0xc0080001a9ed1107 (xSR 0x2000000) bs_remap_earlyboot+0x430: srr1=0x9000000000001033
            r1=0xc00800000000c200 cr=0x84800c08 xer=0 ctr=0xc0000000022dd010 r2=0xc000000002e4d000 frame=0xc00800000000bf80
0xc00800000000c200: at pci_alloc_resource+0x118
0xc00800000000c320: at xhci_pci_attach+0x260
0xc00800000000c3e0: at device_attach+0x568
0xc00800000000c4c0: at bus_generic_new_pass+0x198
0xc00800000000c510: at bus_generic_new_pass+0x10c
0xc00800000000c560: at bus_generic_new_pass+0x10c
0xc00800000000c5b0: at bus_generic_new_pass+0x10c
0xc00800000000c600: at bus_generic_new_pass+0x10c

OK, thanks for testing. I spent most of the day already going over the codebase trying to figure out what might be going wrong, and all I've come up with thus far is that the 32-bit MMIO window setup and DMA configuration both make no sense. It shouldn't even be working at all on POWER9 using the stock code (without this patch).

First, this patch does seem to mess up on the 32 bit MMIO windows. We're supposed to only allocate what we need for end device MMIO, and instead we allocate the entire bridge MMIO window on every bus.

Second, and this is what I really don't understand yet, the DMA tag is absolutely supposed to exclude the 32-bit MMIO window (from sc->m32_pci_base - 1 to BUS_SPACE_MAXADDR_32BIT). Yet, every time I exclude that region, the whole system locks up as soon as the AHCI DMA is set up. It's almost like the system is doing DMA inside the MMIO window, which works most of the time, but at the same time this doesn't really make sense.

Could you do me a favor and try a boot on POWER8 with the stock kernel, but with the DMA tag setup uncommented? I want to know if that breaks everything on POWER8 too, or just POWER9.

Digging further, at least some of the problem here seems to be from our rather unique memory allocation on POWER9:

Physical memory chunk(s):
0x0000000000003000 - 0x0000000000002fff, 0 bytes (0 pages)
0x000000000000e000 - 0x000000000000ffff, 8192 bytes (2 pages)
0x0000000000094000 - 0x0000000000ffffff, 16171008 bytes (3948 pages)
0x0000000100000000 - 0x00000007a2042fff, 28487987200 bytes (6955075 pages)
0x00000007d0006000 - 0x00000007fc72dfff, 745701376 bytes (182056 pages)
0x00000007fdc00000 - 0x00000007ff79ffff, 28966912 bytes (7072 pages)
0x00000007ff7d1000 - 0x00000007ff7effff, 126976 bytes (31 pages)

Can you boot the kernel on POWER8 with the -v flag? I'm suspecting I'm going to see a lot more low memory available.

In any case, the errors on POWER8 are all stemming from the fact that we can't seem to set up the PE structure correctly. I have no idea what the bus structure is; is there any way to boot under Linux and send a dmesg and lspci -tvnn this way? In particular, I have no idea why the POWER8 box is not only failing TCE invalidation (??) but is providing a subordinate bus count of 255 or 17 (!?)....

Also /proc/iomem, which i think has what you're after in more specific detail:

https://people.freebsd.org/~adrian/powerpc64/s822lc/20260203-s822lc-iomem.txt

Thanks for that!

In the interim, I've tracked down the disappearing low memory in D55095 It's still not enough to properly boot, in that we still run out of low memory if the DMA tag is properly set to exclude the MMIO window, but at least that issue explains a lot of the other weirdness I was seeing over time.

After overnight stress testing, a combination of D55095 and using the correct DMA tag seems to have completely resolved the AHCI instability I was seeing with this patchset applied.

I have a fairly good idea of what is causing the POWER8 issues. Let me crank on this for a bit and generate a new version of this patchset to test on POWER8; once POWER8 is fixed I think we should be in good shape here.

tpearson_raptorengineering.com added inline comments.
sys/powerpc/powernv/opal_pci.c
1015

This is supposed to silently fail as we unconditionally clear the freeze. If the PE is not frozen, it will return an error.

1290

This is supposed to silently fail as we unconditionally clear the freeze. If the PE is not frozen, it will return an error.

1413

This is supposed to silently fail as we unconditionally clear the freeze. If the PE is not frozen, it will return an error.