Page MenuHomeFreeBSD

ACPI: add support for (inherited) _DMA limits
Needs ReviewPublic

Authored by greg_unrelenting.technology on Jun 10 2020, 9:23 PM.

Details

Summary

On the Raspberry Pi 4, XHCI can only DMA to the first 3 GiB of memory, and devices behind the GPU — to the first 1 GiB (though DWC OTG USB2.0 worked fine anyway).

_DMA in ACPI describes *valid* ranges (potentially multiple), and inheritance seems to be additive rather than restrictive (quoting ACPI 6.3 spec):

Any ranges described in the resources of a _DMA object can be used by child devices for DMA or bus master transactions

If the _DMA object is not present for a bus device, the OS assumes that any address placed on a bus by a child device will be decoded either by a device on the bus or by the bus itself, (in other words, all address ranges can be used for DMA).

Our bus_dma_tags (sadly) describe only one *in*accessible range, with a filter function option for more advanced checking, but I couldn't get that to work (with lowaddr 0 and a filter set, the system would just hang before calling filters; with lowaddr set to max, the filter would never apply).

But we haven't actually encountered any devices where that one range isn't enough. We don't have to handle all the possible complexity right now.

This patch adds enough support to make the RPi4 fully work (with the memory limiter off in UEFI settings). That is, inheritance by finding the first parent handle with a _DMA method, and calculation of the smallest lowaddr from the ranges.
Handles are used rather than device_get_parent because (at least on arm64) devices in ACPI0004 containers end up as direct children of acpi0.


NetBSD seems to have DMA tags that describe multiple ranges, and both device and CPU side base..

Test Plan

Works on my RPi4: UEFI build v1.14, memory limit off in settings, FreeBSD 12.1 (because CURRENT loader.efi doesn't seem to work.. >_<) mini memstick image on a USB stick, CURRENT kernel with this + D25201 to make XHCI attach at all + D25203 for a device to test the behind-GPU restriction too + all my other patches lol but that's not relevant.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

I've tested head -r363932 under uefi/ACPI v1.20 with the 3072 limit disabled and it failed the huge-file duplicate and diff/cmp test:

# cp -aRx /usr/obj/clang-armv7-on-aarch64.tar /usr/obj/clang-armv7-on-aarch64.alt_tar
# diff /usr/obj/clang-armv7-on-aarch64.tar /usr/obj/clang-armv7-on-aarch64.alt_tar
Binary files /usr/obj/clang-armv7-on-aarch64.tar and /usr/obj/clang-armv7-on-aarch64.alt_tar differ
# cmp -l /usr/obj/clang-armv7-on-aarch64.tar /usr/obj/clang-armv7-on-aarch64.alt_tar | head -30
2633269249   3   0
2633269251   3   0
2633269252  55   0
2633269253   6   0
2633269254  21   0
2633269255 227   0
2633269256   1   0
2633269257 135   0
2633269258 336   0
2633269259  22 140
2633269260   0 100
2633269261 346   0
2633269262 353   0
2633269265 227   0
2633269266   1 160
2633269267 170 140
2633269268 336 100
2633269269  22   0
2633269271 362   0
2633269272 353   0
2633269275 227   0
2633269276   1   0
2633269277 225   0
2633269278 336   0
2633269279  22   0
2633269281 376   0
2633269282 353   1
2633269285   0   1
2633269289   0 223
2633269290   0 321

For reference:

# ls -ldT /usr/obj/clang-armv7-on-aarch64*
-rw-r--r--  1 root  wheel  11570948096 Jul 18 18:32:37 2020 /usr/obj/clang-armv7-on-aarch64.alt_tar
-rw-r--r--  1 root  wheel  11570948096 Jul 18 18:32:37 2020 /usr/obj/clang-armv7-on-aarch64.tar

(So the over 10 GiByte original file is significantly larger than the 8 GiByte RAM, although I've had large but much smaller files fail as well.)

I figured I'd gather some more evidence by putting back the 3072 MiByte limit and diff'ing/cmp'ing the above files and then making another duplicate and diff'ing it.

# diff /usr/obj/clang-armv7-on-aarch64.tar /usr/obj/clang-armv7-on-aarch64.alt_tar 
Binary files /usr/obj/clang-armv7-on-aarch64.tar and /usr/obj/clang-armv7-on-aarch64.alt_tar differ
# cmp -l /usr/obj/clang-armv7-on-aarch64.tar /usr/obj/clang-armv7-on-aarch64.alt_tar | head -30
2633269249   3   0
2633269251   3   0
2633269252  55   0
2633269253   6   0
2633269254  21   0
2633269255 227   0
2633269256   1   0
2633269257 135   0
2633269258 336   0
2633269259  22 140
2633269260   0 100
2633269261 346   0
2633269262 353   0
2633269265 227   0
2633269266   1 160
2633269267 170 140
2633269268 336 100
2633269269  22   0
2633269271 362   0
2633269272 353   0
2633269275 227   0
2633269276   1   0
2633269277 225   0
2633269278 336   0
2633269279  22   0
2633269281 376   0
2633269282 353   1
2633269285   0   1
2633269289   0 223
2633269290   0 321

So the copy made without the 3072 MiByte limit appears to be corrupt as written: it looks like the error is not just at diff/cmp time.

Making and testing a .alt2_tar copy with the 3072 MiByte limitation imposed:

# cp -aRx /usr/obj/clang-armv7-on-aarch64.tar /usr/obj/clang-armv7-on-aarch64.alt2_tar
# diff /usr/obj/clang-armv7-on-aarch64.tar /usr/obj/clang-armv7-on-aarch64.alt2_tar 
#

So, with the 3072 MiByte limit: no evidence of a problem with duplicating the huge file and diff'ing the result.

I should note that this has been reported on the lists since back on 2020-Jun-21 with the patch. All examples fit with the reported structure but where in the files differences show up changes from test to test, as does the amount that is different. Multiple subranges of pages are normally bad: it does not stay bad once initially going bad during the copy. I've not seen the problem on NetBSD or Linux.

The problem was originally noticed on small files that buildworld generated corrupted content in when > 3072 MiByte of RAM was in use and that later stages failed from the corruptions. The huge file based tests are just simpler and more reliable to get good/bad evidence from. Large/huge is not essential to having the problem overall.

sys/dev/acpica/acpi.c
528

Note: from a RPi4B:

/usr/include/machine/bus.h:#define BUS_SPACE_MAXSIZE 0xFFFFFFFFFFFFFFFFUL

bus_dma(9) reports for maxsize and maxsegsz:

Maximum size, in bytes, of the sum of all
                              segment lengths
Maximum size,	in bytes, of a segment in any DMA
			  mapped region.

The above code reads to me like the bus_dma_tag_create call allows for segments too large to fit the context (maxsegsz should be no larger than limits.lowaddr) and a sum of all segments max size with the same problem (maxsize should be no larger than limits.lowaddr).

sys/dev/acpica/acpi.c
528

should be no larger than limits.lowaddr

I thought the code that uses the tags would be smart enough to take the lowest of the limits into account :)

sys/dev/acpica/acpi.c
528

Did I miss bus_dma(9)'s documentation of such implicitly applied constraints relative to the 4th argument? I see maxsegsz vs. boundary constraints in the dma implementation code but not relative to the 4th argment to bus_dma_tag_create. Nor maxsize relative to it. Maxmem does not seem to be based on lowaddr either so constraints relative to it do not track the issue in question.

But the code is not familiar and I may have missed something. If the intent for maxsegsz and maxsize is to be implicitly constrained by other figures from the call, I think that bus_dma(9) should document that. (Not that this apic.c update is the right place for also doing such or that you would be the right person for such.)

sys/dev/acpica/acpi.c
528

I adjusted part of your patch to use:

+       if (bus_dma_tag_create(NULL, 1, 0,
+               limits.lowaddr, BUS_SPACE_MAXADDR, NULL, NULL,
+               limits.lowaddr, BUS_SPACE_UNRESTRICTED, limits.lowaddr,
+               coherent ? BUS_DMA_COHERENT : 0, NULL, NULL,
+               result) != 0)
+               return (ENOMEM);

Result: The huge-file duplicate-and-diff tests still fail when > 3072 MiByte of RAM is enabled. So what I questioned is irrelevant to the test failures. Sorry for the noise.

May be the following notes will be useful to someone with an appropriate background . . .

The RPi4B has 3 different types of DMA engines (DMA (1 GiByte range), DMA LITE (only 65536 Byte maxsegsz), DMA4 (the larger address range but 1 GiByte maxsegsz still)) with different capabilities and some differing register encodings, such as for CB and next CB addresses (>>5 shift for DMA4). So far I've failed to find how the RPi4B ACPI context picks what DMA engines will be used when, much less if it deal with these issues at all. The notes below are based in part on DTB context material that I figured out to some degree (instead of ACPI material).

rpi_DATA_2711_1p0.pdf reports that soc/0-10 have 2 DMA engine types (0-6 vs. 7-10 as it turns out) as well as the scb/DM4-engines (11-14):

QUOTE (with omitted material marked by ". . .")

. . .
The BCM2711 DMA Controller provides a total of 16 DMA channels. Four of these are DMA Lite channels (with reduced performance and features), and four of them are DMA4 channels (with increased performance and a wider address range).
. . .
4.5. DMA LITE Engines

Several of the DMA engines are of the LITE design. This is a reduced specification engine designed to save space. The engine behaves in the same way as a normal DMA engine except for the following differences:
. . .
	• The DMA length register is now 16 bits, limiting the maximum transferable length to 65536 bytes.
. . .
4.6. DMA4 Engines

Several of the DMA engines are of the DMA4 design. These have higher performance due to their uncoupled read/write design and can access up to 40 address bits. Unlike the other DMA engines they are also capable of performing write bursts. Note that they directly access the full 35-bit address bus of the BCM2711 and so bypass the paging registers of the DMA and DMA Lite engines.

DMA channel 11 is additionally able to access the PCIe interface.

The register map indicates (with some extra notes added):

0-6: DMA
7-10: DMA LITE (65536 bytes limit, for example)
11-14: DMA4 (11 is special relative to "PCIe interface")
("DMA Channel 15 is exclusively used by the VPU.")

rpi_DATA_2711_1p0.pdf also reports (I ignore 2D DMA transfer mode here):

For DMA engines 0-6: XLENGTH has bits 29:0 bits 31:30 are write as 0, read as do not care. That would put the matching maxsegsz as 2**30 == 1,073,741,824 which matches a 1 GiByte space.

For DMA LITE engines 7-10: XLENGTH has bit 15:0 bits 31:16 are write as 0, read as do not care. That would put the matching maxsegsz as 2**16 == 65,536.

For DMA4 engines 11-14: XLENGTH has bits 29:0 bits 31:30 are write as 0, read as do not care. That would put the matching maxsegsz as 2**30 == 1,073,741,824 which is smaller than the 3 GiByte space associated with xHCI. DMA4 also encodes CB and next CB addresses via a >>5 shift (to get the upper bits of the 40 address bits and drop some bits that had to be zero anyway).

The DTB reported by fdt print / in the u-boot I used has only DM4 as being separate:

. . .
       soc {
               dma@7e007000 {
                       compatible = "brcm,bcm2835-dma";
                       reg = <0x7e007000 0x00000b00>;
                       interrupts = * 0x0000000007ef645c [0x00000084];
                       interrupt-names = "dma0", "dma1", "dma2", "dma3", "dma4", "dma5", "dma6", "dma7", "dma8", "dma9", "dma10";
                       #dma-cells = <0x00000001>;
                       brcm,dma-channel-mask = <0x000001f5>;
                       phandle = <0x0000000b>;
               };

       scb {
. . .
               dma@7e007b00 {
                       compatible = "brcm,bcm2711-dma";
                       reg = <0x00000000 0x7e007b00 0x00000000 0x00000400>;
                       interrupts = <0x00000000 0x00000059 0x00000004 0x00000000 0x0000005a 0x00000004 0x00000000 0x0000005b 0x00000004 0x00000000 0x0000005c 0x00000004>;
                       interrupt-names = "dma11", "dma12", "dma13", "dma14";
                       #dma-cells = <0x00000001>;
                       brcm,dma-channel-mask = <0x00007000>;
                       phandle = <0x0000003d>;
               };
. . .

I do not see anything in the DTB that makes the DMA (0-6) vs. DMA LITE (7-10) distinction explicit.

sys/dev/acpica/acpi.c
527

Based on sysctl hw.busdma output and testing a change for the u-boot/DTB/fdt code context that worked, I've tried using limits.lowaddr-1 here instead.

So far the installed build now passes my huge-file duplicate-and-diff tests when booted via uefi/ACPI v1.20 .

What sysctl showed me was the likes of (before
changes that lead to lack of zone2 for u-boot/dtb/fdt):

. . .
hw.busdma.zone2.lowaddr: 0x3c000fff
. . .
hw.busdma.zone1.lowaddr: 0x3fffffff
. . .
hw.busdma.zone0.lowaddr: 0xffffffff
. . .

So I've guessed that lowaddr should identify the
end page of the possibly-use-it region, not the
first do-not-use-it page. If I've guessed wrong,
at most it would bounce one page that it could
avoid bouncing. But, if I guessed correct, it
might bounce a page that it should instead of
not doing so. Thus the "-1" addition.

For reference, after the first duplicate-and-diff test:

# sysctl hw.busdma
hw.busdma.zone0.alignment: 4096
hw.busdma.zone0.lowaddr: 0xbfffffff
hw.busdma.zone0.total_deferred: 0
hw.busdma.zone0.total_bounced: 762568
hw.busdma.zone0.active_bpages: 12
hw.busdma.zone0.reserved_bpages: 0
hw.busdma.zone0.free_bpages: 824
hw.busdma.zone0.total_bpages: 836
hw.busdma.total_bpages: 836
sys/dev/acpica/acpi.c
527

FYI: sys/arm64/arm64/busdma_machdep.c 's common_bus_dma_tag_create has:

common->lowaddr = trunc_page((vm_paddr_t)lowaddr) + (PAGE_SIZE - 1);
common->highaddr = trunc_page((vm_paddr_t)highaddr) + (PAGE_SIZE - 1);

and so forces reference to the last byte of the page identified by lowaddr (and highaddr). Thus lowaddr needs to identify the last page that can avoid bouncing (or earlier).

arm64, arm, powerpc, riscv, mips, and x86 all use the trunc_page and PAGE_SIZE-1 computation. I do not remember a hint of that from the documentation for bus_dma_tag_create.

Looks to me like limits.lowaddr is still one too large, and so identifying the start of the wrong page instead of the end of the correct page by the code's overall criteria. As I remember, I still use limits.lowaddr-1 where I originally indicated in order to have the RPi4B have huge copy-then-diff operations work. (Though I've not been directly testing that for some time. It has been almost a year since I added the "-1" notes.)