The architecture nexus should handle allocation and release of memory and
interrupts. These need to be visible to the rest of the kernel and not
hidden in the OpenFirmware code.
This is a very early tentative proposal. The issue is without this resource allocations which get to the nexus may be given ranges overlapping with what the OpenFirmware code has. Two drivers trying to use the same physical address range will be Bad(tm).
Looks like every variant of nexus.c except for PPC is already prepared to handle memory/interrupt allocations and therefore the correct solution is to simply forward all allocations to the nexus. There is likely a need to introduce similar code in sys/powerpc/powerpc/nexus.c.
Trick would then be every variant of sys/*/*/nexus.c would have mostly identical code for handling these allocations. Time to introduce sys/dev/bus/root.c? Difficulty here is there is no architecture-independent function for retrieving the highest physical address (for mem_rman.rm_end).
This is needed for the attempt at getting Xen operational on ARM. When running on Xen there is a need to allocate physical addresses and tell the hypervisor to map virtual devices there. If both nexus.c and ofwbus.c only have part of the story of which physical addresses are used, trouble will ensue.
To state the problem as I understand it: ofwbus fails to reserve or report its resource usage back to nexus, violating an assumption of newbus. This is of no consequence when all devices allocate resources through ofwbus, but becomes an issue when a child of nexus needs to allocate an unused resource region (this will be the case for Xen's PV bus on arm64).
As you point out, simply removing these routines is an insufficient solution on powerpc; instead, they should probably be moved back into powerpc's nexus.c. Other architectures are set up to handle allocations from nexus already, but we would need verify that things still behave properly on each platform if this change is applied.
Tagging @nwhitehorn, who might have input on this.
@mhorne, you got it exactly right. Issue is ofwbus is keeping everything to itself, which is fine if *everything* goes through ofwbus, but a major problem for anything independent of ofwbus (such as Xen PV).
I hadn't tracked down when it happened, so that is interesting to see it originated on PPC. Thing is what was being done was almost right. Quite a bit of sys/*/*/nexus.c is common code which should be merged together, but it needs to move to a nexus_common.c or root.c which is $somewhere (sys/dev? sys/kern?).
Updating with a potential move to sys/powerpc/powerpc/nexus.c. I've tried to move this a bit closer to the other nexuses. Most others use static variables in the file, rather than the softc. I've also opted to limit the regions for the resource managers since the resource manager code actually breaks when you use the 0 to ~0 range.
Okay, I was able to boot this on a few devices for arm, arm64, and riscv.
For arm, I needed to add an intr_rman and its handling to its nexus bus (see D32357).
Otherwise, devices attached fine and I was able to see that the expected changes in the output of devinfo -u; that is, the same resources allocated from ofwbus are now allocated from nexus. Check out P517 and P518.
I was unable to test mips, mainly since the only platform I have access to (MALTA64 in QEMU) does not use FDT. Given the state of the port and its planned removal from CURRENT, this shouldn't be a blocker.
Someone with PowerPC hardware will need to do the same testing, ideally on both 32 and 64-bit.
Please do not change the region boundaries, we want to maintain the same behaviour before and after moving the code around.
Moving this out of ofwbus means that arm, arm64, mips, and riscv are now without a bus_adjust_resource method. I think this is okay, it is not used very widely and arm64 with ACPI has always been without it. Still, it will be worth noting in the final commit message, as this could be a regression.
I've run into some an additional oddity with the Open Firmware core. Only allocated resources are visible in devinfo.
The device-tree provided by Xen (see: D29875) provides an interrupt and a memory range. The memory range is completely empty, but is provided as a suggested range for mapping the Xen grant table. This range is completely invisible unless Xen's device-tree code allocates it.
Each time I run across the default range I hear klaxons screaming "uninitialized! uninitialized! uninitialized!" (new form of Dalek?). I really dislike leaving this alone, even if that ends up being a separate diff.
Huh. I hadn't noticed any issues, but that is certainly worthy of mention.
Well, they are not uninitialized, but initialized to the default/maximum values. Now that the overflow in rman_reserve_resource_bound() has been fixed we shouldn't fear using this range, it is meant to be supported.
I've tested these changes on a PPC64 VM. With them, the virtual SCSI device fails to attach, making it impossible to mount root.
Below are the relevant dmesg messages:
ofwbus0: <Open Firmware Device Tree> on nexus0 xicp0: <External Interrupt Presentation Controller> on ofwbus0 xicp0: Handling CPUs 0-31 vdevice0: <POWER Hypervisor Virtual Device Root> on ofwbus0 vscsi0: <POWER Hypervisor Virtual SCSI Bus> irq 16781571 on vdevice0 vscsi0: Could not allocate IRQ device_attach: vscsi0 attach returned 6
The errors are similar on real hardware, where it fails to attach aacraid and xhci:
aacraid0: <Adaptec RAID Controller> mem 0x80000000-0x800fffff,0x80180000-0x801803ff irq 1038328 at device 0.0 numa-domain 0 on pci5 aacraid0: Async. mode not supported by current driver, sync. mode enforced. Please update driver to get full performance. aacraid0: Enable Raw I/O aacraid0: Enable 64-bit array aacraid0: using line interrupts aacraid0: can't allocate interrupt device_attach: aacraid0 attach returned 22 xhci0: <XHCI (generic) USB 3.0 controller> mem 0x80000000-0x8000ffff,0x80010000-0x80011fff irq 1036280 at device 0.0 numa-domain 0 on pci7 xhci0: 64 bytes context size, 64-bit DMA xhci0: Could not allocate IRQ device_attach: xhci0 attach returned 6
The issue can be reproduced with a non-accelerated VM too, on non-powerpc hardware, with a command like:
qemu-system-ppc64 -cpu power8 -machine pseries,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken -m 1G -drive file=disk.qcow2,media=disk \ -netdev user,id=lan0,ipv6=off -device e1000,netdev=lan0,mac=52:54:98:13:64:01 -nographic -vga none -serial mon:stdio -nodefaults
The disk image can be obtained from FreeBSD CI: https://artifact.ci.freebsd.org/snapshot/14.0-CURRENT/latest/powerpc/powerpc64/disk.qcow2.zst.
Then only the kernel would have to be replaced, by the one to be tested.
Just note that CI image expects virtio disks instead of vscsi ones, so, even the default image stops at mountroot, but in this case the disks are detected and root can be found at ufs:da0s3.
BTW, virtio driver works fine even with this patch (it can be tested by adding ",if=virtio" to QEMU's -drive argument).
That I cannot dispute, so seems I have to let things be exactly the way they were before despite really not liking that.
Sparse interrupt numbering? That certainly explains why the PowerPC interrupt code is written the way it was.
Yes. The IBM series of hardware (pseries, powernv) use a 24-bit integer for the IRQ, which is a special routing encoding for the interrupt controller, which encodes priority, source, and other information. Not all powerpc hardware is this way, but all the newer (power8 and later? Maybe power7?) IBM hardware is.
Before these changes, ofw_bus was a simplebus that also did a bit of resource management. With the rman stuff gone, is there any difference at all between ofw_bus and simplebus?
Don't take this as me requiring that ofw_bus be removed as part of making these changes. But if the answer is that there's no difference anymore, we should think about consolidating and cleaning things up as a followup change.
I think there are 3 features INTRNG would need for an attempt at taking over as the FreeBSD interrupt framework:
- Support for interrupt numbers being assigned outside of INTRNG (x86)
- Support for allocation of interrupt ranges (x86, used by Xen code, used for MSI interrupts)
- Support for sparse interrupt numbering (PowerPC)
I suspect multiple pieces of x86 code assume interrupt numbers can be mapped to an interrupt structure quickly, thus #3 would likely mean hash table or tree. Simply pondering what would be needed for INTRNG to take over...