Page MenuHomeFreeBSD

to be compatible with an IOMMU LinuxKPI should use bus_dma(9)
ClosedPublic

Authored by tychon on Apr 8 2019, 3:04 PM.
Tags
None
Referenced Files
Unknown Object (File)
Feb 23 2024, 10:28 AM
Unknown Object (File)
Feb 13 2024, 9:35 PM
Unknown Object (File)
Jan 25 2024, 6:41 AM
Unknown Object (File)
Jan 24 2024, 3:08 PM
Unknown Object (File)
Dec 20 2023, 7:11 AM
Unknown Object (File)
Dec 13 2023, 10:22 PM
Unknown Object (File)
Dec 11 2023, 3:07 PM
Unknown Object (File)
Dec 4 2023, 8:23 PM

Details

Summary

Currently the LinuxKPI uses physical addresses to create mappings for the DMA related interfaces. These physical addresses are not true DMA addresses and hence can't be used to program a device's DMA engine in conjunction with an IOMMU. To properly supply DMA addresses for use with an IOMMU, the LinuxKPI should use the bus_dma(9) interfaces to setup and teardown the mappings.

The linux API provides an implementation of allocate/free plus map/unmap large memory regions, map/unmap a single memory region, map/unmap a scatterlist, and a dma_pool interface for a cache of smaller memory regions.

For the most part the linux allocation/mapping setup translates nicely into bus_dmamap_create() and _bus_dmamap_load_phys(). The invalidation/teardown presents more of a challenge as both bus_dmamap_unload() and bus_dmamap_destroy() need to be called for a given DMA address. To find the bus_dmamap_t associated with a given DMA address a path-compressed radix trie is used with the DMA address functioning as a key. The tree is populated with entries (struct linux_dma_obj) containing the dmamap, the DMA address and in the case of a dma_pool entry the virtual address too. A UMA cache is used for the tree's entries as well as the tree’s internal nodes.

The dma_pool supplies both allocated and mapped memory regions with characteristics specified when the pool is created. For efficiency a cache-only UMA zone is used. The items in the cache contain a DMA-able memory region and a DMA pctrie entry. The zones import and release routines allocate and free the DMA-memory with bus_dmamem_alloc() and bus_dmamem_free() as well the associated DMA tree entries (struct linux_dma_obj).

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sys/compat/linuxkpi/common/src/linux_pci.c
148

Linux errors are negative. I think this should be:
return (-error);

Could you add some comments and update summary to explain the design ?

sys/compat/linuxkpi/common/include/linux/dma-mapping.h
158

Please use or do not use {} symmetrically around both branches.

While trying to explain the design I realized that the UMA "linux_dma objects" zone doesn't need to be per-device. I can make that global.

tychon marked 2 inline comments as done.

No locking is provided internally by the path-compressed radix trie implementation. It worked shockingly well without until it didn't. Adding locking to make the LinuxKPI DMA routines MT-Safe.

sys/compat/linuxkpi/common/include/linux/dma-mapping.h
105

This looks somewhat risky. I mean, this function does not look like the proper place to allocate tag.

112

What if driver called dma_set_mask() more than once for the same device, wouldn't we leak tags ? From my reading of the text HOWTOs in Linux kernel tree, this is allowed and even recommended practice.

sys/compat/linuxkpi/common/include/linux/pci.h
222

Why is this data lives in pci_dev and not in device ?
Is there anything in the patch which prevents it working for non-pci devices ?

sys/compat/linuxkpi/common/src/linux_pci.c
136

You create the tag with NULL lockfunc. I believe it means that all loads and allocations must specify NOWAIT then, otherwise bounce busdma becomes unhappy when it really bounces and has to delay allocation into swi thread. And we cannot operate in callback anyway.

477

Perhaps assert that nseg == 1 ?

539

If sgl is de-facto physically contiguous, this load loop would create per-segment mapping regardless of the optimization possibility. Linux guide notes the optimization of gluing adjacent sg segments, so might be translate linux sg into freebsd sg and then load it instead of doing that manually ?

598

Why loading in ctor, instead at import time ?

600

Assert nseg == 1 ?

634

See other comment, WAITOK seems to not work.

Addressed a few code review comments. There is a bit more work to do. Glancing at the feedback, I forgot 'assert that nseg == 1'. Plus I've got the optimization of gluing adjacent sg segments in the works too.

tychon added inline comments.
sys/compat/linuxkpi/common/include/linux/dma-mapping.h
105

I moved things around. I also made it possible to call linux_dma_tag_init() repeatedly without leaking tags.

sys/compat/linuxkpi/common/include/linux/pci.h
222

No there actually isn't anything that prevents it working for non-pci devices. Searching the sources I couldn't find that use cause. Regardless I hoisted dma_priv to struct device from struct pci_dev to support that possibility.

sys/compat/linuxkpi/common/src/linux_pci.c
136

I'm not all that familiar with deferred loads but I'm now passing busdma_lock_mutex and a mutex (which as I write this I realized I forgot to initialize -- onto the TO-DO list) so I think this should now work.

598

If the load occurs at import time it may go into the IOMMU before it's intended and remain there longer than intended too. By doing the load in the ctor and the invalidate in the dtor the IOMMU reflects the pool allocation/free calls.

sys/compat/linuxkpi/common/include/linux/dma-mapping.h
95

This is weird formatting.

sys/compat/linuxkpi/common/src/linux_pci.c
122

Perhaps compare old and new mask, and only destroy old tag if they are non-equal ?

136

It is much more complicated, unfortunately. Waitable busdma requests require callbacks. The callbacks are called from the swi context and only in callback you get the segments prepared for loading into device' dma engine.

I am not even sure if some handshake would work, where original requester sleeps until callback wakes him up and somehow marshals the segments back. IMO it is enough to use non-sleepable allocs for now,

tychon marked 3 inline comments as done.

Address further code review feedback: use non-sleepable allocs, fix weird formatting and add KASSERT(negs == 1).

@johalun : I'd like to have you test these patches with drm-next before going upstream. Can you do that?

And just a heads-up these patches uncovered an issue with the cache-only zone destructor trying to destroy a non-existent keg. That's being worked in D19835.

sys/compat/linuxkpi/common/src/linux_pci.c
468

should high be min(mask, BUS_SPACE_MAXADDR_32BIT) ? I am not sure.

499

It is still waitable.

539

This is still not handled, right ? I suggest at least adding a comment noting this issue.

565

And this one.

Incorporate a few more review comments: add missing BUS_DMA_NOWAIT flags to _bus_dmamap_load_phys() and optimize linux_dma_map_sg_attrs() to coalesce physically contiguous scatter list entries.

kib@: thanks for your comments! I think I've addressed all the outstanding ones.

tychon added inline comments.
sys/compat/linuxkpi/common/src/linux_pci.c
468

I'm not sure either. I think GFP_DMA32 is an older flag which pre-dates the mask code.

This mimics the existing code so I'm apt to leave it alone. Minimally it shouldn't be a regression.

This revision is now accepted and ready to land.Apr 12 2019, 11:49 AM

@johalun : I'd like to have you test these patches with drm-next before going upstream. Can you do that?

I'm in the middle of relocating so it would have to be in the later part of next week.

Tested on my Haswell laptop with drm-v5.0, everything works (both with DMAR on and off), this line is new in dmesg:

vgapci0: dmar0 pci0:0:2:0 rid 10 domain 0 mgaw 48 agaw 48 re-mapped

so it seems like the GPU is IOMMU'd. (full dmesg)

Also tested on an AMD Ryzen + Vega system, no regressions. (No IOMMU there because no one wrote a dmar equivalent for AMD IOMMU…)

btw, amdgpu touches dma_mask in one place, had to do this to fix build:

--- i/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ w/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -366,7 +366,7 @@ void amdgpu_amdkfd_get_local_mem_info(struct kgd_dev *kgd,
                                      struct kfd_local_mem_info *mem_info)
 {
        struct amdgpu_device *adev = (struct amdgpu_device *)kgd;
-       uint64_t address_mask = adev->dev->dma_mask ? ~*adev->dev->dma_mask :
+       uint64_t address_mask = adev->dev->dma_priv ? ~*((uint64_t*)adev->dev->dma_priv) :
                                             ~((1ULL << 32) - 1);
        resource_size_t aper_limit = adev->gmc.aper_base + adev->gmc.aper_size;
In D19845#427477, @greg_unrelenting.technology wrote:

Tested on my Haswell laptop with drm-v5.0, everything works (both with DMAR on and off), this line is new in dmesg:

vgapci0: dmar0 pci0:0:2:0 rid 10 domain 0 mgaw 48 agaw 48 re-mapped

so it seems like the GPU is IOMMU'd. (full dmesg)

Indeed that device is using the IOMMU in re-map mode -- the most strict/stressful case. Thanks for testing that!

In D19845#427478, @greg_unrelenting.technology wrote:

Also tested on an AMD Ryzen + Vega system, no regressions. (No IOMMU there because no one wrote a dmar equivalent for AMD IOMMU…)

btw, amdgpu touches dma_mask in one place, had to do this to fix build:

--- i/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
+++ w/drivers/gpu/drm/amd/amdgpu/amdgpu_amdkfd.c
@@ -366,7 +366,7 @@ void amdgpu_amdkfd_get_local_mem_info(struct kgd_dev *kgd,
                                      struct kfd_local_mem_info *mem_info)
 {
        struct amdgpu_device *adev = (struct amdgpu_device *)kgd;
-       uint64_t address_mask = adev->dev->dma_mask ? ~*adev->dev->dma_mask :
+       uint64_t address_mask = adev->dev->dma_priv ? ~*((uint64_t*)adev->dev->dma_priv) :
                                             ~((1ULL << 32) - 1);
        resource_size_t aper_limit = adev->gmc.aper_base + adev->gmc.aper_size;

That's also another great datapoint. I don't believe the linux API has a get-mask function so I'm not sure how, even if we actually wanted to, avoid necessitating that code change.

Fix most trivial of trivial whitespace issues. I just want to avoid the tool chain complaining about any divergences so I'm updating the diff.

This revision now requires review to proceed.Apr 16 2019, 8:03 PM

When committing and MFC-ing this patch you'll need to bump the __FreeBSD_version in sys/sys/param.h, because you are changing fundamental structures and external kernel modules needs to be rebuilt.

sys/compat/linuxkpi/common/include/linux/scatterlist.h
45 ↗(On Diff #56264)

Did you forget to remove the old "length" field?

sys/compat/linuxkpi/common/src/linux_pci.c
499

This function have shared resources on the BUSDMA tag, if I'm not mistaken.

Please serialize all calls to _bus_dmamap_load_phys().

Bump __FreeBSD_version and serialize required bus_dma(9) calls.

tychon added inline comments.
sys/compat/linuxkpi/common/include/linux/scatterlist.h
45 ↗(On Diff #56264)

No, the length is used an 'input' parameter. The 'dma_' versions are the 'output'. It appears to be a weird interface.

sys/compat/linuxkpi/common/src/linux_pci.c
499

That was an interesting find! Looking at bounce and dmar, indeed several of those API functions aren't reentrant with the same bus_dma tag. Seems like many drivers get this for free -- probably quite accidentally -- while locking rings and things like that I've added prerequisite locking.

Some more i915 GPU testing (w/o the latest update here): after using Firefox (opengl layers, xwayland) for some time, GPU resets start happening

drmn0: Resetting chip for stuck wait on rcs0
drmn0: Resetting chip for stuck wait on rcs0
drmn0: Resetting chip for stuck wait on rcs0
…
DMAR0: Fault Overflow
DMAR0: vgapci0: pci:0:2:0 sid 10 fault acc 0 adt 0x0 reason 0x5 addr 2e09000
DMAR0: Fault Overflow
DMAR0: vgapci0: pci:0:2:0 sid 10 fault acc 0 adt 0x0 reason 0x5 addr 2e09000

and eventually the whole system freezes if I don't quit the compositor / switch to vt console.

In D19845#428768, @greg_unrelenting.technology wrote:

Some more i915 GPU testing (w/o the latest update here): after using Firefox (opengl layers, xwayland) for some time, GPU resets start happening

drmn0: Resetting chip for stuck wait on rcs0
drmn0: Resetting chip for stuck wait on rcs0
drmn0: Resetting chip for stuck wait on rcs0
…
DMAR0: Fault Overflow
DMAR0: vgapci0: pci:0:2:0 sid 10 fault acc 0 adt 0x0 reason 0x5 addr 2e09000
DMAR0: Fault Overflow
DMAR0: vgapci0: pci:0:2:0 sid 10 fault acc 0 adt 0x0 reason 0x5 addr 2e09000

and eventually the whole system freezes if I don't quit the compositor / switch to vt console.

Looks like a symptom of non-translatable physical address. I've encountered drivers which need additional work outside of the scope of this effort. Perhaps this is the case there as I can't any more cases in the Linux KPI where a physical address is substituted for a DMA one.

Also, I assume this is in remap mode. Does it work in identify map mode hw.busdma.default="bounce"? Unless there is an API which escaped, if it works in hw.dmar.enable="0" it's not a regression from before :-/

Looks like a symptom of non-translatable physical address. I've encountered drivers which need additional work outside of the scope of this effort. Perhaps this is the case there as I can't any more cases in the Linux KPI where a physical address is substituted for a DMA one.

Also, I assume this is in remap mode. Does it work in identify map mode hw.busdma.default="bounce"? Unless there is an API which escaped, if it works in hw.dmar.enable="0" it's not a regression from before :-/

Yeah, no regressions, just something to try to fix eventually… What does bounce do? Should that be done without remap mode?

In D19845#428768, @greg_unrelenting.technology wrote:

Some more i915 GPU testing (w/o the latest update here): after using Firefox (opengl layers, xwayland) for some time, GPU resets start happening

drmn0: Resetting chip for stuck wait on rcs0
drmn0: Resetting chip for stuck wait on rcs0
drmn0: Resetting chip for stuck wait on rcs0
…
DMAR0: Fault Overflow
DMAR0: vgapci0: pci:0:2:0 sid 10 fault acc 0 adt 0x0 reason 0x5 addr 2e09000
DMAR0: Fault Overflow
DMAR0: vgapci0: pci:0:2:0 sid 10 fault acc 0 adt 0x0 reason 0x5 addr 2e09000

and eventually the whole system freezes if I don't quit the compositor / switch to vt console.

Looks like a symptom of non-translatable physical address. I've encountered drivers which need additional work outside of the scope of this effort. Perhaps this is the case there as I can't any more cases in the Linux KPI where a physical address is substituted for a DMA one.

Also, I assume this is in remap mode. Does it work in identify map mode hw.busdma.default="bounce"? Unless there is an API which escaped, if it works in hw.dmar.enable="0" it's not a regression from before :-/

I am absolutely non-surprised that Intel GPU does not work with the DMAR enabled. First, from the time when I ported drm2, I remember quite vividly that the driver directly manages dma translations by the calls to DMAR intel driver, for many things. I expect this to not change, because standard DMA KPI is somewhat limited for the driver purposes.

Second, Intel chipsets typically have two DMAR units, one dedicated for the graphics, and another serving south bridge and everything under it. The GPU DMAR is the constant source of chipset erratas, so unless all required workarounds are enabled in the code, sometimes up the the level of disabling GPU DMAR at all, it can be problematic. For instance, I remember that for SandyBridge, RC6 was incompatible with GPU DMAR in hw.

@johalun: can you try again with the dma_lock changes?

Update: the GPU hangs are *not* caused by IOMMU remapping. Possibly a bug in drm-v5.0… I'll try the updated version of this patch with stable drm later

This revision is now accepted and ready to land.Apr 24 2019, 3:58 PM
This revision was automatically updated to reflect the committed changes.