Page MenuHomeFreeBSD

Fix regression issues after r346645 in the LinuxKPI
AbandonedPublic

Authored by hselasky on Apr 29 2019, 4:16 PM.
Tags
None
Referenced Files
F106155861: D20097.id57030.diff
Thu, Dec 26, 7:43 AM
Unknown Object (File)
Wed, Dec 25, 3:39 PM
Unknown Object (File)
Mon, Dec 23, 7:12 PM
Unknown Object (File)
Sat, Dec 21, 10:13 PM
Unknown Object (File)
Fri, Dec 20, 2:05 AM
Unknown Object (File)
Sun, Dec 15, 10:35 AM
Unknown Object (File)
Sun, Dec 8, 12:46 AM
Unknown Object (File)
Sat, Dec 7, 11:21 AM

Details

Reviewers
slavash
johalun
zeising
tychon
kib
Group Reviewers
x11
Summary

Fix regression issues after r346645 in the LinuxKPI

Allow loading the same DMA address multiple times without any prior unload.

Sponsored by: Mellanox Technologies

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 24065

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/compat/linuxkpi/common/src/linux_pci.c
531

Compat layer should only keep the largest mapping, refcounted. The mapping only removed when the last reference goes away.

Rebase patch and fix multiple issues:

  1. SG-list must be mapped AS-IS w/o any optimisations.
  2. sg_dma_len() must be equal to sg->length.
  3. Allow loading same address twice.
This revision now requires review to proceed.Apr 30 2019, 2:56 PM
hselasky edited the summary of this revision. (Show Details)

Ideally #2 and #3 would be discrete commits but I see the value in single place to point folks to.

This revision is now accepted and ready to land.Apr 30 2019, 3:48 PM

I've been running this for about 8 hours, on a mostly idle laptop with Firefox running, so far so good. I would like to give the reporters on the mailing list some time to test as well, since the hang is very random (I might have been lucky). I'll keep running the patch though.

Tested with rping loop:

for i in $(seq 1 100); do for j in $(seq 23 65535); do rping -s -C 1 -S $j; done;done

sys/compat/linuxkpi/common/src/linux_pci.c
502

Why not write

struct linux_dma_obj *obj, *old;
533

You still do not compare the length of old and new requests.

You still do not ref-count the duplicated entry.

This patch and r346984M, drm-current-kmod-4.16.g20190430 work fine, firefox and huge load 3D game.

Would it be prudent to decouple #2 and #3? The fix for #1 raises some interesting implementation questions and may ultimately be fixed better in the driver anyway.

I've had further success stories graphics wise, and my laptop is still running, so from a graphics perspective, this is good to go.

@kib: I was thinking it would be better in a followup commit to add a debug knob to print the backtrace when double mappings happen or when the unmap cannot find the DMA address, and trace this down in the ibcore code. It is apparently a bug. Right now DRM-next and IBCORE works with this patch, and I think the current behaviour to kill old mappings on the same address is fine.

@kib: I was thinking it would be better in a followup commit to add a debug knob to print the backtrace when double mappings happen or when the unmap cannot find the DMA address, and trace this down in the ibcore code. It is apparently a bug. Right now DRM-next and IBCORE works with this patch, and I think the current behaviour to kill old mappings on the same address is fine.

I disagree completely. Suppose that two RDMA clients mmap the same shared memory, and then use the same buffer for transfers.

@kib: Using the same memory location like this is not allowed or we should not allow it.

Because scatter gather lists are involved, we cannot assume that a memory mapping starting at the same physical address are identical, just by comparing the size of the object (!)

Technically the right approach is to fail double DMA maps, by returning 0 on the second approach. Right now that will fail ibcore ....

@kib: Using the same memory location like this is not allowed or we should not allow it.

Why ?

Because scatter gather lists are involved, we cannot assume that a memory mapping starting at the same physical address are identical, just by comparing the size of the object (!)

Then we should provide a method to identify and distinguish them.

Technically the right approach is to fail double DMA maps, by returning 0 on the second approach. Right now that will fail ibcore ....

No, it is wrong approach, and you already know this.

In D20097#433280, @kib wrote:

@kib: Using the same memory location like this is not allowed or we should not allow it.

Why ?

Because scatter gather lists are involved, we cannot assume that a memory mapping starting at the same physical address are identical, just by comparing the size of the object (!)

Then we should provide a method to identify and distinguish them.

For sg lists the tracking is clearly wrong right now. Linuxkpi should store each segment mapping into the trie, instead of only doing that for the first one.

Hm, in fact I backtrack and wonder why do you need to store scatterlist mapping in trie at all. The trie is used to look up mapping by its bus address. For scatterlist, unmap is passed the original scatterlist which contains all the information, and there is no need to store any side-band data at all. Or if needed, the additional fields can be added to scatterlist (I realized that you want to store the dma_map there).

Technically the right approach is to fail double DMA maps, by returning 0 on the second approach. Right now that will fail ibcore ....

No, it is wrong approach, and you already know this.

@kib

Hm, in fact I backtrack and wonder why do you need to store scatterlist mapping in trie at all.

You are right! I can store the DMA map pointer in the scatterlist itself. I'll update the patch.

hselasky added inline comments.
sys/compat/linuxkpi/common/src/linux_pci.c
502

I prefer one declaration per line.

533

Fixed in next version.

In D20097#433274, @kib wrote:

@kib: I was thinking it would be better in a followup commit to add a debug knob to print the backtrace when double mappings happen or when the unmap cannot find the DMA address, and trace this down in the ibcore code. It is apparently a bug. Right now DRM-next and IBCORE works with this patch, and I think the current behaviour to kill old mappings on the same address is fine.

I disagree completely. Suppose that two RDMA clients mmap the same shared memory, and then use the same buffer for transfers.

I don't see how multiple mappings aren't a bug. The linux API doesn't do any ref-counting. There the first unmap would wipe out both. If there is sharing of an underlying resource that needs to be coordinated at a higher level; this isn't the place. Tolerating it to provide some cover is one thing and I'm not sure I even agree that is the right approach. IMHO fixing the driver is.

hselasky marked 2 inline comments as done.

Address conserns from kib@ .

This revision now requires review to proceed.May 2 2019, 10:55 AM

Shorten some long lines. Add one more unlikely().

I don't see how multiple mappings aren't a bug. The linux API doesn't do any ref-counting. There the first unmap would wipe out both. If there is sharing of an underlying resource that needs to be coordinated at a higher level; this isn't the place. Tolerating it to provide some cover is one thing and I'm not sure I even agree that is the right approach. IMHO fixing the driver is.

The double-mappings should not be bugs same as they are fine for our native busdma KPI. Consider:

  • for bounce, without bounce, bus address == phys address and nothing happens both on map and unmap
  • for bounce, with bounce, second map allocates another set of bounce pages, so bus address is different
  • for DMAR, guest mapping is created anew for each map request, again it is fine.

One of my points is that physical address is user-controllable, in some situations, so the KPI must handle duplicates.

sys/compat/linuxkpi/common/src/linux_pci.c
485

mem == NULL

589

Assert that dma_refcount >= 1. Explicitly compare with 0.

619

I would be happier if we put some canary value for unallocated dma_map, and assert that we only call bus_dmamap_unload/destroy on proper dma_map.

Overall it looks fine now, please fix small nits I pointed out, mostly the natural assertions that can improve the code.

hselasky added inline comments.
sys/compat/linuxkpi/common/src/linux_pci.c
589

Refcounts are unsigned and start at zero, so asserting doesn't make sense.

I've renamed the variable to dma_share_count to make this clearer.

619

I'm not sure if we can cover all the cases for initializing S/G lists.

hselasky marked 2 inline comments as done.

Response to @kib's questions. No functional change.

Convert share count to a refcount and assert valid value. Suggested by @kib

In D20097#433442, @kib wrote:

I don't see how multiple mappings aren't a bug. The linux API doesn't do any ref-counting. There the first unmap would wipe out both. If there is sharing of an underlying resource that needs to be coordinated at a higher level; this isn't the place. Tolerating it to provide some cover is one thing and I'm not sure I even agree that is the right approach. IMHO fixing the driver is.

The double-mappings should not be bugs same as they are fine for our native busdma KPI. Consider:

  • for bounce, without bounce, bus address == phys address and nothing happens both on map and unmap
  • for bounce, with bounce, second map allocates another set of bounce pages, so bus address is different
  • for DMAR, guest mapping is created anew for each map request, again it is fine.

One of my points is that physical address is user-controllable, in some situations, so the KPI must handle duplicates.

I can see how you may end up calling map a few times with the same address: e.g. a NIC sending the same packet to multiple hosts or something like that and basically you end up calling dma_map a few times with the same physical address.

For bounce-with-bounce and DMAR the DMA address will be different and you are fine. It's the bounce-without-bounce case that appears to have gotten us into trouble. Too bad you need to call _bus_dmamap_load_phys to find out you are in this case and need to ref-count to keep your dmamap allocations consistent.

In the case where physical address equals DMA address and the pctrie lookup succeeds the complexity of checking lengths and swapping a new mapping for the old isn't needed. If it was needed you’d have a nasty race where the old mapping was invalidated with something mid-flight.

With that in mind, this should be sufficient:

	old = LINUX_DMA_PCTRIE_LOOKUP(&priv->ptree, obj->dma_addr);
	if (unlikely(old != NULL)) {
		old->dma_refcount++;
		bus_dmamap_unload(priv->dmat, obj->dma_map);
		bus_dmamap_destroy(priv->dmat, obj->dma_map);			
		DMA_PRIV_UNLOCK(priv);
		uma_zfree(linux_dma_obj_zone, obj);
		return (old->dma_addr);
	}

and avoids a cycle of LINUX_DMA_PCTRIE_REMOVE(), LINUX_DMA_PCTRIE_INSERT() along with a call to uma_zfree() while holding the DMA lock.

In the case where physical address equals DMA address and the pctrie lookup succeeds the complexity of checking lengths and swapping a new mapping for the old isn't needed. If it was needed you’d have a nasty race where the old mapping was invalidated with something mid-flight.

With that in mind, this should be sufficient:

	old = LINUX_DMA_PCTRIE_LOOKUP(&priv->ptree, obj->dma_addr);
	if (unlikely(old != NULL)) {
		old->dma_refcount++;
		bus_dmamap_unload(priv->dmat, obj->dma_map);
		bus_dmamap_destroy(priv->dmat, obj->dma_map);			
		DMA_PRIV_UNLOCK(priv);
		uma_zfree(linux_dma_obj_zone, obj);
		return (old->dma_addr);
	}

and avoids a cycle of LINUX_DMA_PCTRIE_REMOVE(), LINUX_DMA_PCTRIE_INSERT() along with a call to uma_zfree() while holding the DMA lock.

Why do you suggest that length check is not needed ? Putting the discussion of a possible race aside, why the lengths of two loads must be the same ?

sys/compat/linuxkpi/common/src/linux_pci.c
535

old->dma_refcount == 0 should be an assert.

In D20097#433503, @kib wrote:

Why do you suggest that length check is not needed ? Putting the discussion of a possible race aside, why the lengths of two loads must be the same ?

It's a bit an exaggeration to say the length doesn't matter. And multiple mappings can indeed be of different lengths. What I should say is that if the length actually did have any impact then this code wouldn't work, or at least would have some race conditions,

In thinking about this further the overhead of the added LINUX_DMA_PCTRIE_LOOKUP() imposed on DMAR and bounce with bounce and the overhead of bus_dmamap_create(), _bus_dmamap_load_phys(), bus_dmamap_unload() and bus_dmamap_destroy() imposed on bounce without bounce aren’t necessary.

If you can agree with addition of:

	int
	bus_dma_could_bounce(bus_dma_tag_t dmat)
	{
		return (dmat->common.impl != &bus_dma_bounce_impl || dmat->bounce_flags & BUS_DMA_COULD_BOUNCE);
	}

or something like that then the remainder can be greatly simplified to something like:

	linux_dma_map_phys()
	{
	...
		if (!bus_dma_could_bounce(priv->dmat)) {
			return (phys);
		}
	…
	}
	linux_dma_unmap()
	{
		…
		if (!bus_dma_could_bounce(priv->dmat)) {
			return;
		}
		...
	}

and the point about lengths, races and maintaining reference counters goes away entirely.

What I’m not a fan of is the current approach which does more than necessary but leaves open the possibility of doing not enough should it actually become necessary so why have a false sense of security.

I've been running the new version of the patch for about 2 hours without any graphics issues. I'll let the computer run through the night.

In D20097#433503, @kib wrote:

Why do you suggest that length check is not needed ? Putting the discussion of a possible race aside, why the lengths of two loads must be the same ?

It's a bit an exaggeration to say the length doesn't matter. And multiple mappings can indeed be of different lengths. What I should say is that if the length actually did have any impact then this code wouldn't work, or at least would have some race conditions,

In thinking about this further the overhead of the added LINUX_DMA_PCTRIE_LOOKUP() imposed on DMAR and bounce with bounce and the overhead of bus_dmamap_create(), _bus_dmamap_load_phys(), bus_dmamap_unload() and bus_dmamap_destroy() imposed on bounce without bounce aren’t necessary.

If you can agree with addition of:

	int
	bus_dma_could_bounce(bus_dma_tag_t dmat)
	{
		return (dmat->common.impl != &bus_dma_bounce_impl || dmat->bounce_flags & BUS_DMA_COULD_BOUNCE);
	}

or something like that then the remainder can be greatly simplified to something like:

	linux_dma_map_phys()
	{
	...
		if (!bus_dma_could_bounce(priv->dmat)) {
			return (phys);
		}
	…
	}
	linux_dma_unmap()
	{
		…
		if (!bus_dma_could_bounce(priv->dmat)) {
			return;
		}
		...
	}

and the point about lengths, races and maintaining reference counters goes away entirely.

What I’m not a fan of is the current approach which does more than necessary but leaves open the possibility of doing not enough should it actually become necessary so why have a false sense of security.

Even though addresses might not bounce, they might be translated. I'm not sure how common that is.

--HPS

In D20097#433503, @kib wrote:

Why do you suggest that length check is not needed ? Putting the discussion of a possible race aside, why the lengths of two loads must be the same ?

It's a bit an exaggeration to say the length doesn't matter. And multiple mappings can indeed be of different lengths. What I should say is that if the length actually did have any impact then this code wouldn't work, or at least would have some race conditions,

In thinking about this further the overhead of the added LINUX_DMA_PCTRIE_LOOKUP() imposed on DMAR and bounce with bounce and the overhead of bus_dmamap_create(), _bus_dmamap_load_phys(), bus_dmamap_unload() and bus_dmamap_destroy() imposed on bounce without bounce aren’t necessary.

If you can agree with addition of:

	int
	bus_dma_could_bounce(bus_dma_tag_t dmat)
	{
		return (dmat->common.impl != &bus_dma_bounce_impl || dmat->bounce_flags & BUS_DMA_COULD_BOUNCE);
	}

This is fine in principle, but its usage below is not correct.

or something like that then the remainder can be greatly simplified to something like:

	linux_dma_map_phys()
	{
	...
		if (!bus_dma_could_bounce(priv->dmat)) {
			return (phys);
		}
	…
	}
	linux_dma_unmap()
	{
		…
		if (!bus_dma_could_bounce(priv->dmat)) {
			return;
		}
		...
	}

and the point about lengths, races and maintaining reference counters goes away entirely.

No, it is not, because the presence of BUS_DMA_COULD_BOUNCE does not mean that busdma always bounces. For instance, if the flag was set because the mask is limited, but the allocated pages all fall below the maxaddr, busdma_bounce.c is optimized enough to note that and avoid copying. In other words, even if bus_dma_could_bounce() returned true, does not imply that we cannot get two loads of the same phys address with identical bus addresses.

What I’m not a fan of is the current approach which does more than necessary but leaves open the possibility of doing not enough should it actually become necessary so why have a false sense of security.

What do you mean by 'not enough' ? Are you able to provide (a hypothetical) situation where the current Hans' patch would break ?

This revision is now accepted and ready to land.May 3 2019, 9:33 AM

This is still running fine from a graphics perspective.
DRM has been broken for some time now, what's needed to get this in?

This is still running fine from a graphics perspective.
DRM has been broken for some time now, what's needed to get this in?

Decouple #2 and #3 from #1. They are ready. #1 is still being discussed. It seems possible to simplify it. kib@ raised some valid points, and I'd like to respond. I need a little more time to think about it.

hselasky edited the summary of this revision. (Show Details)
hselasky edited the test plan for this revision. (Show Details)

Update patch after recent commits.

This revision now requires review to proceed.May 4 2019, 9:53 AM

Joining a few threads on this back together, below is a fleshed out version (minus arm64) of what I am thinking. It removes some penalty (the additional LOOKUP) when dmar is enabled and significantly reduces overhead in the bounce-without-bounce case.

diff --git a/sys/compat/linuxkpi/common/src/linux_pci.c b/sys/compat/linuxkpi/common/src/linux_pci.c
index e135298b0b9..8877d1a6956 100644
--- a/sys/compat/linuxkpi/common/src/linux_pci.c
+++ b/sys/compat/linuxkpi/common/src/linux_pci.c
@@ -502,6 +502,15 @@ linux_dma_map_phys(struct device *dev, vm_paddr_t phys, size_t len)
 
 	priv = dev->dma_priv;
 
+	/*
+	 * If the resultant mapping will be entirely 1:1 with the
+	 * physical address, short-circuit the remainder of the
+	 * bus_dma API.  This avoids tracking collisions in the pctrie
+	 * with the additional benefit of reducing overhead.
+	 */
+	if (bus_dma_id_mapped(priv->dmat, phys, len))
+		return (phys);
+
 	obj = uma_zalloc(linux_dma_obj_zone, 0);
 
 	DMA_PRIV_LOCK(priv);
diff --git a/sys/x86/include/bus_dma.h b/sys/x86/include/bus_dma.h
index f8a75682c6b..b4da787e9cf 100644
--- a/sys/x86/include/bus_dma.h
+++ b/sys/x86/include/bus_dma.h
@@ -35,6 +35,18 @@
 
 #include <x86/busdma_impl.h>
 
+/*
+ * Is DMA address 1:1 mapping of physical address
+ */
+static inline bool
+bus_dma_id_mapped(bus_dma_tag_t dmat, vm_paddr_t buf, bus_size_t buflen)
+{
+	struct bus_dma_tag_common *tc;
+
+	tc = (struct bus_dma_tag_common *)dmat;
+	return (tc->impl->id_mapped(dmat, buf, buflen));
+}
+
 /*
  * Allocate a handle for mapping from kva/uva/physical
  * address space into bus device space.
diff --git a/sys/x86/include/busdma_impl.h b/sys/x86/include/busdma_impl.h
index 7b1dbeb0341..97c335f9a79 100644
--- a/sys/x86/include/busdma_impl.h
+++ b/sys/x86/include/busdma_impl.h
@@ -62,6 +62,7 @@ struct bus_dma_impl {
 	    void *lockfuncarg, bus_dma_tag_t *dmat);
 	int (*tag_destroy)(bus_dma_tag_t dmat);
 	int (*tag_set_domain)(bus_dma_tag_t);
+	bool (*id_mapped)(bus_dma_tag_t, vm_paddr_t, bus_size_t);
 	int (*map_create)(bus_dma_tag_t dmat, int flags, bus_dmamap_t *mapp);
 	int (*map_destroy)(bus_dma_tag_t dmat, bus_dmamap_t map);
 	int (*mem_alloc)(bus_dma_tag_t dmat, void** vaddr, int flags,
diff --git a/sys/x86/iommu/busdma_dmar.c b/sys/x86/iommu/busdma_dmar.c
index 913f5b2af39..7cd68c5d26f 100644
--- a/sys/x86/iommu/busdma_dmar.c
+++ b/sys/x86/iommu/busdma_dmar.c
@@ -365,6 +365,13 @@ dmar_bus_dma_tag_destroy(bus_dma_tag_t dmat1)
 	return (error);
 }
 
+static bool
+dmar_bus_dma_id_mapped(bus_dma_tag_t dmat, vm_paddr_t buf, bus_size_t buflen)
+{
+
+	return (false);
+}
+
 static int
 dmar_bus_dmamap_create(bus_dma_tag_t dmat, int flags, bus_dmamap_t *mapp)
 {
@@ -857,6 +864,7 @@ struct bus_dma_impl bus_dma_dmar_impl = {
 	.tag_create = dmar_bus_dma_tag_create,
 	.tag_destroy = dmar_bus_dma_tag_destroy,
 	.tag_set_domain = dmar_bus_dma_tag_set_domain,
+	.id_mapped = dmar_bus_dma_id_mapped,
 	.map_create = dmar_bus_dmamap_create,
 	.map_destroy = dmar_bus_dmamap_destroy,
 	.mem_alloc = dmar_bus_dmamem_alloc,
diff --git a/sys/x86/x86/busdma_bounce.c b/sys/x86/x86/busdma_bounce.c
index edc90621edf..3d7dc7fad79 100644
--- a/sys/x86/x86/busdma_bounce.c
+++ b/sys/x86/x86/busdma_bounce.c
@@ -141,6 +141,8 @@ static int reserve_bounce_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
 static bus_addr_t add_bounce_page(bus_dma_tag_t dmat, bus_dmamap_t map,
     vm_offset_t vaddr, vm_paddr_t addr1, vm_paddr_t addr2, bus_size_t size);
 static void free_bounce_page(bus_dma_tag_t dmat, struct bounce_page *bpage);
+static int _bus_dmamap_pagesneeded(bus_dma_tag_t dmat, vm_paddr_t buf,
+    bus_size_t buflen);
 static void _bus_dmamap_count_pages(bus_dma_tag_t dmat, bus_dmamap_t map,
     pmap_t pmap, void *buf, bus_size_t buflen, int flags);
 static void _bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dmamap_t map,
@@ -223,6 +225,19 @@ bounce_bus_dma_tag_create(bus_dma_tag_t parent, bus_size_t alignment,
 	return (error);
 }
 
+static bool
+bounce_bus_dma_id_mapped(bus_dma_tag_t dmat, vm_paddr_t buf, bus_size_t buflen)
+{
+	int pagesneeded;
+
+	if ((dmat->bounce_flags & BUS_DMA_COULD_BOUNCE) == 0)
+		return (true);
+
+	pagesneeded = _bus_dmamap_pagesneeded(dmat, buf, buflen);
+
+	return (pagesneeded == 0);
+}
+
 /*
  * Update the domain for the tag.  We may need to reallocate the zone and
  * bounce pages.
@@ -501,29 +516,38 @@ bounce_bus_dmamem_free(bus_dma_tag_t dmat, void *vaddr, bus_dmamap_t map)
 	    dmat->bounce_flags);
 }
 
-static void
-_bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dmamap_t map, vm_paddr_t buf,
-    bus_size_t buflen, int flags)
+static int
+_bus_dmamap_pagesneeded(bus_dma_tag_t dmat, vm_paddr_t buf, bus_size_t buflen)
 {
 	vm_paddr_t curaddr;
 	bus_size_t sgsize;
+	int pagesneeded = 0;
 
-	if (map != &nobounce_dmamap && map->pagesneeded == 0) {
-		/*
-		 * Count the number of bounce pages
-		 * needed in order to complete this transfer
-		 */
-		curaddr = buf;
-		while (buflen != 0) {
-			sgsize = MIN(buflen, dmat->common.maxsegsz);
-			if (bus_dma_run_filter(&dmat->common, curaddr)) {
-				sgsize = MIN(sgsize,
-				    PAGE_SIZE - (curaddr & PAGE_MASK));
-				map->pagesneeded++;
-			}
-			curaddr += sgsize;
-			buflen -= sgsize;
+	/*
+	 * Count the number of bounce pages needed in order to
+	 * complete this transfer
+	 */
+	curaddr = buf;
+	while (buflen != 0) {
+		sgsize = MIN(buflen, dmat->common.maxsegsz);
+		if (bus_dma_run_filter(&dmat->common, curaddr)) {
+			sgsize = MIN(sgsize,
+			    PAGE_SIZE - (curaddr & PAGE_MASK));
+			pagesneeded++;
 		}
+		curaddr += sgsize;
+		buflen -= sgsize;
+	}
+
+	return (pagesneeded);
+}
+
+static void
+_bus_dmamap_count_phys(bus_dma_tag_t dmat, bus_dmamap_t map, vm_paddr_t buf,
+    bus_size_t buflen, int flags)
+{
+	if (map != &nobounce_dmamap && map->pagesneeded == 0) {
+		map->pagesneeded = _bus_dmamap_pagesneeded(dmat, buf, buflen);
 		CTR1(KTR_BUSDMA, "pagesneeded= %d\n", map->pagesneeded);
 	}
 }
@@ -1305,6 +1329,7 @@ struct bus_dma_impl bus_dma_bounce_impl = {
 	.tag_create = bounce_bus_dma_tag_create,
 	.tag_destroy = bounce_bus_dma_tag_destroy,
 	.tag_set_domain = bounce_bus_dma_tag_set_domain,
+	.id_mapped = bounce_bus_dma_id_mapped,
 	.map_create = bounce_bus_dmamap_create,
 	.map_destroy = bounce_bus_dmamap_destroy,
 	.mem_alloc = bounce_bus_dmamem_alloc,

Joining a few threads on this back together, below is a fleshed out version (minus arm64) of what I am thinking. It removes some penalty (the additional LOOKUP) when dmar is enabled and significantly reduces overhead in the bounce-without-bounce case.

Could you create a new review and put the patch there, please ?

In D20097#434692, @kib wrote:

Joining a few threads on this back together, below is a fleshed out version (minus arm64) of what I am thinking. It removes some penalty (the additional LOOKUP) when dmar is enabled and significantly reduces overhead in the bounce-without-bounce case.

Could you create a new review and put the patch there, please ?

+1

And we will take it for a spin to catch any RoCE (RDMA) regression.

What is remaining in this review?

What is remaining in this review?

This review has been superseded by D20181.