Page MenuHomeFreeBSD

PCI function address aliasing support
AbandonedPublic

Authored by br on Tue, Sep 8, 3:38 PM.

Details

Reviewers
andrew
kib
Group Reviewers
PCI
Summary

This encodes a 8-bit function address as a bit number in dma_aliases bit string, which is used to program a permissible function source address for DMA in an IOMMU system.

For example Marvel SATA controller uses function number 0 for non-data transfers, and 1 for DMA. In this case both has to be programmed in IOMMU to allow access.

Test Plan

Tested on Neoverse N1 (N1SDP) with Marvel SATA.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

br requested review of this revision.Tue, Sep 8, 3:38 PM
br created this revision.

Why is it in the pci code rather then iommu? I could imagine broken non-pci devices could also need a DMA alias.

br added a comment.Thu, Sep 10, 4:24 PM

Why is it in the pci code rather then iommu? I could imagine broken non-pci devices could also need a DMA alias.

In non-pci that would be not a function alias, but simply another SID, or a set of SIDs (because only PCI splits the RID/SID to bus:device:function) . And we can't have an alias for a bus or device address, because that would be a different IOMMU domain.
This is obtained from Linux as is. In linux they convert this PCI aliases to the list of ids in the local IOMMU system (SIDs in case of SMMU). And I think this is a right approach.
So the "function address" is property of PCI only. There could be multiple sources that all together build the list of SIDs, and PCI is simply one of them.
So I think in !PCI we should have a different way of aliasing, the way how aliases constructed, etc

br added a reviewer: kib.Thu, Sep 10, 4:37 PM
kib added a comment.Thu, Sep 10, 11:46 PM

Well, this is not the right approach. I am guilty of the buswide_ctx() of course, but this one goes much further.

Proper solution is to finish my patch to support DMAR (IOMMU ?) domains, where single domain with its page table serves some number of contexts. Then multi-function devices that must share translations should be put into common domain instead of creating per-function domains.

br added a comment.Sat, Sep 12, 4:21 PM
In D26360#587123, @kib wrote:

Well, this is not the right approach. I am guilty of the buswide_ctx() of course, but this one goes much further.

Proper solution is to finish my patch to support DMAR (IOMMU ?) domains, where single domain with its page table serves some number of contexts. Then multi-function devices that must share translations should be put into common domain instead of creating per-function domains.

We don't create a per-function domain. This is just to setup a Stream Table Entry (STE) alias in the 2-level walk table.
Marvel SATA issues non-dma requests from SID 0x600, and DMA from 0x601. So SMMU simply could not find a valid STE in the position of 0x601 in the 2-level walk table.
The solution is just to copy the same STE to a required position in the walk table and everything works (this is what Linux does as well).

kib added a comment.Wed, Sep 16, 3:58 PM
In D26360#587442, @br wrote:
In D26360#587123, @kib wrote:

Well, this is not the right approach. I am guilty of the buswide_ctx() of course, but this one goes much further.

Proper solution is to finish my patch to support DMAR (IOMMU ?) domains, where single domain with its page table serves some number of contexts. Then multi-function devices that must share translations should be put into common domain instead of creating per-function domains.

We don't create a per-function domain. This is just to setup a Stream Table Entry (STE) alias in the 2-level walk table.
Marvel SATA issues non-dma requests from SID 0x600, and DMA from 0x601. So SMMU simply could not find a valid STE in the position of 0x601 in the 2-level walk table.
The solution is just to copy the same STE to a required position in the walk table and everything works (this is what Linux does as well).

In essence, you just reworded what I wrote.

Let me be more explicit: I believe such cases should be handled by creating domains which contain more than one context.

br added a comment.Thu, Sep 17, 11:06 AM
In D26360#588498, @kib wrote:
In D26360#587442, @br wrote:
In D26360#587123, @kib wrote:

Well, this is not the right approach. I am guilty of the buswide_ctx() of course, but this one goes much further.

Proper solution is to finish my patch to support DMAR (IOMMU ?) domains, where single domain with its page table serves some number of contexts. Then multi-function devices that must share translations should be put into common domain instead of creating per-function domains.

We don't create a per-function domain. This is just to setup a Stream Table Entry (STE) alias in the 2-level walk table.
Marvel SATA issues non-dma requests from SID 0x600, and DMA from 0x601. So SMMU simply could not find a valid STE in the position of 0x601 in the 2-level walk table.
The solution is just to copy the same STE to a required position in the walk table and everything works (this is what Linux does as well).

In essence, you just reworded what I wrote.

Let me be more explicit: I believe such cases should be handled by creating domains which contain more than one context.

Do you mean that each function alias should be in a separate iommu context?

kib added a comment.Thu, Sep 17, 5:16 PM
In D26360#588657, @br wrote:
In D26360#588498, @kib wrote:
In D26360#587442, @br wrote:
In D26360#587123, @kib wrote:

Well, this is not the right approach. I am guilty of the buswide_ctx() of course, but this one goes much further.

Proper solution is to finish my patch to support DMAR (IOMMU ?) domains, where single domain with its page table serves some number of contexts. Then multi-function devices that must share translations should be put into common domain instead of creating per-function domains.

We don't create a per-function domain. This is just to setup a Stream Table Entry (STE) alias in the 2-level walk table.
Marvel SATA issues non-dma requests from SID 0x600, and DMA from 0x601. So SMMU simply could not find a valid STE in the position of 0x601 in the 2-level walk table.
The solution is just to copy the same STE to a required position in the walk table and everything works (this is what Linux does as well).

In essence, you just reworded what I wrote.

Let me be more explicit: I believe such cases should be handled by creating domains which contain more than one context.

Do you mean that each function alias should be in a separate iommu context?

There are no aliases in PCIe spec or DMAR. Yes, DMAR spec makes 1:1 correspondence between context and RID. Domain can group several contexts that share page tables, making all RIDs from domain using identical translations. I believe AMD IOMMU is same.

br added a comment.Fri, Sep 18, 11:33 AM
In D26360#588838, @kib wrote:
In D26360#588657, @br wrote:
In D26360#588498, @kib wrote:
In D26360#587442, @br wrote:
In D26360#587123, @kib wrote:

Well, this is not the right approach. I am guilty of the buswide_ctx() of course, but this one goes much further.

Proper solution is to finish my patch to support DMAR (IOMMU ?) domains, where single domain with its page table serves some number of contexts. Then multi-function devices that must share translations should be put into common domain instead of creating per-function domains.

We don't create a per-function domain. This is just to setup a Stream Table Entry (STE) alias in the 2-level walk table.
Marvel SATA issues non-dma requests from SID 0x600, and DMA from 0x601. So SMMU simply could not find a valid STE in the position of 0x601 in the 2-level walk table.
The solution is just to copy the same STE to a required position in the walk table and everything works (this is what Linux does as well).

In essence, you just reworded what I wrote.

Let me be more explicit: I believe such cases should be handled by creating domains which contain more than one context.

Do you mean that each function alias should be in a separate iommu context?

There are no aliases in PCIe spec or DMAR. Yes, DMAR spec makes 1:1 correspondence between context and RID. Domain can group several contexts that share page tables, making all RIDs from domain using identical translations. I believe AMD IOMMU is same.

But the AHCI driver request only one tag ?
So it creates a controller-wide DMA tag, and uses it to make a child tags for data transfers/etc.
We have only one request from it to acpi_iommu_get_dma_tag().

jhb added a subscriber: jhb.Fri, Sep 18, 3:53 PM

I don't think this belongs in PCI either, but I think we need a way for a driver to get a reference to the IOMMU backing a tag it has created (bus_dma_tag_get_iommu_domain) or some such and then we need a way to add other devices to that domain. I don't think Ruslan has shared the associated Linux bug, but it was actually found on Intel with DMAR using this Marvell controller (and is just as broken when using SMMU on Arm) that initiates some DMA request from function 1. However, there is no function 1 in lspci output, which means Marvell is just setting random bits in PCI-e headers for devices that don't exist (so you can't shut them up by clearing their busmaster_en bit, you can't FLR them, you could never use them for PCI passthrough. Apparently Marvell engineers don't have a good grasp of English and/or the base PCI (not even PCI-e) specifications). This means we don't have a real device_t for the function 1. That might mean either having a 'iommu_domain_add_device' that takes a raw PCI RID, or using a quirk to create fake device_t's. The fake device_t is probably a lot more work, so having a way to add an explicit PCI RID is the simplest way to go. This could then be done as a driver quirk in achi(4) for relevant Marvell controllers.

Also, you should almost never use spin mutexes. Most spin mutexes in Linux should be MTX_DEF in FreeBSD.

Reference to the Linux bug: https://bugzilla.kernel.org/show_bug.cgi?id=42679

kib added a comment.Fri, Sep 18, 8:08 PM

Yes, I was aware of some AHCI controllers mis-handling RIDs. It is the duty of the driver to ensure that proper domain is created which includes all relevant contexts (not that we have an KPI for this right now).

But wait, why cannot you use iommu_set_buswide_ctx() there ?

br added a comment.Mon, Sep 21, 10:34 AM
In D26360#589183, @kib wrote:

Yes, I was aware of some AHCI controllers mis-handling RIDs. It is the duty of the driver to ensure that proper domain is created which includes all relevant contexts (not that we have an KPI for this right now).

But wait, why cannot you use iommu_set_buswide_ctx() there ?

How?

Here are RID table for the PCI devices on our N1SDP board:
0x800 XHCI
0x700 Realtek ethernet adapter
0x600 AHCI

So all the devices are on bus 0, but different slots (6, 7, 8).

kib added a comment.EditedWed, Sep 23, 11:39 PM
In D26360#589847, @br wrote:
In D26360#589183, @kib wrote:

Yes, I was aware of some AHCI controllers mis-handling RIDs. It is the duty of the driver to ensure that proper domain is created which includes all relevant contexts (not that we have an KPI for this right now).

But wait, why cannot you use iommu_set_buswide_ctx() there ?

How?

Here are RID table for the PCI devices on our N1SDP board:
0x800 XHCI
0x700 Realtek ethernet adapter
0x600 AHCI

So all the devices are on bus 0, but different slots (6, 7, 8).

And they are PCIe devices ?
PCIe devices must have a parent, that is switch/upstream port, to represent the link. Real PCIe is not bus, but rather a network of point-to-point links, and each link can only have one device down (or switch).

The only cases I knew, where multiple slots are filled by unrelated functionality were:

  • Intel north bridge pseudo PCIe devices used for configuration of uncore. They are not PCIe at all, just Intel decided that giving access to config registers is simpler (?) through PCIe MCFG mapping
  • Intel PCH hubs, which have a lot of stuff integrated and reported as slots on single bus.

But your devices are definitely not from some integrated chip, I hardly believe that Marvell AHCI and Realtek ethernet are put together on one IC. They must be physically separate chips with dedicated links to each. No ? Then I am very curious what it is.

Added later:
and in fact 6/7/8 are the bus numbers, not slots. Without sillyness like ARI, pci addresses have 3 bits for slots, and 5 bits for functions, which occupies total 1 byte for s:f part. So e.g. 0x600 is really 6:0:0 bus 6, slot 0, func 0.

In other words, buswide function should be enough.

br abandoned this revision.Thu, Sep 24, 2:01 PM

Thanks kib@: buswide quirk works, I have updated the main diff:
https://reviews.freebsd.org/D24618