Page MenuHomeFreeBSD

x86/dma_bounce: rework _bus_dmamap_load_ma implementation
ClosedPublic

Authored by royger on Oct 3 2014, 2:47 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Jul 23, 11:52 PM
Unknown Object (File)
Wed, Jul 3, 1:49 PM
Unknown Object (File)
Sun, Jun 30, 6:35 PM
Unknown Object (File)
Jun 23 2024, 6:02 PM
Unknown Object (File)
Jun 23 2024, 12:19 AM
Unknown Object (File)
Jun 20 2024, 8:18 PM
Unknown Object (File)
Jun 20 2024, 2:40 PM
Unknown Object (File)
Jun 17 2024, 3:20 PM
Subscribers

Details

Summary

The implementation of bus_dmamap_load_ma_triv currently calls
_bus_dmamap_load_phys on each page that is part of the passed in buffer.
Since each page is treated as an individual buffer, the resulting behaviour
is different from the behaviour of _bus_dmamap_load_buffer. This breaks
certain drivers, like Xen blkfront.

To put an example, if an unmapped buffer of size 4096 that starts at offset
13 into the first page is passed to the current _bus_dmamap_load_ma
implementation (so the ma array contains two pages), the result is that two
segments are created, one with a size of 4083 and the other with size 13
(because two independant calls to _bus_dmamap_load_phys are performed, one
for each physical page). If the same is done with a mapped buffer and
calling _bus_dmamap_load_buffer the result is that only one segment is
created, with a size of 4096.

This patch removes the usage of bus_dmamap_load_ma_triv in x86 bounce buffer
code and implements _bus_dmamap_load_ma so that it's behaviour is the same
as the mapped version (_bus_dmamap_load_buffer). This patch only modifies
the x86 bounce buffer code, but IMHO it should be ported to all other arches
for consistency.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

royger retitled this revision from to x86/dma_bounce: rework _bus_dmamap_load_ma implementation.
royger updated this object.
royger edited the test plan for this revision. (Show Details)
royger added reviewers: gibbs, kib.

Is my reading of your description correct that the problem is due to unaligned buffers passed from some usermode app ?

This was one of the stumbling blocks when I wrote the DMAR busdma backend as well (another was with action when filter declares that the passed address is bad). IMO, it is not correct or useful to re-align the arbitrary buffer by bouncing; the original offset should be kept. It is not only my opinion, note the BUS_DMA_KEEP_PG_OFFSET option added by hps for USB controllers.

Also, the existing physio consumers, e.g. newfs or fsck_ffs, pass correctly aligned buffers.

I cannot insist, but my opinion on the issue is that physio() should return EINVAL for unaligned buffers instead.

Yes, the problem is because usermode apps pass unaligned buffers.

I've created a patch that rejects buffers not aligned to sector size, see:

https://people.freebsd.org/~royger/prevent_unaligned_buffers.patch

Unfortunately this breaks newfs at least, because the do_sbwrite function in sbin/newfs/mkfs.c tries to write an unaligned buffer to disk (and what's worse, it doesn't check the return value for errors so newfs still thinks everything went all right). IMHO even if we fix everything in src to pass correctly aligned buffers such a change could cause a massive fallout in other 3rd party apps. On the other hand, if we impose this restriction (aligned buffers) it could simplify the bounce code.

Disallowing unaligned buffers from userland would be a huge POLA violation. If we want to go down this route, I would be okay with detecting unaligned buffers upon entry to the kernel (in the syscall handlers or a common top-level API all of these syscalls invoke) and thunking them there. We'd want to track these alignment faults and make them visible via sysctl and tools like ktrace.

sys/x86/x86/busdma_bounce.c
82–85 ↗(On Diff #1874)

These need better variable names. Perhaps just make it an array of two SG elements.

569–570 ↗(On Diff #1874)

Don't we consume a page entry on every iteration of this loop? Seems like page_index should be incremented unconditionally.

770–842 ↗(On Diff #1874)

This is a lot of code duplication to make this work. I think we need to use a different paradigm in order to make this easier to code/use. For example, if the "buffer to bounce" were described always by a memdesc (sys/memdesc.h) then each bounce entry would just need to record the starting offset into the memdesc that matches that bounce entry. Routines for iterating across a memdesc, splitting them, copying into/out of them would then be available across architectures.

I only bring this up because I had to recently make some changes to ZFS which required the bus dma code to understand virtual SG lists. So far, I've only updated the drivers needed by Spectra's product. But this was extremely painful/error prone. GEOM, scsi_da/ada, md, nvme, and probably others I've missed, all do this in slightly different ways. It's time to just write this stuff once and convert everyone to use it.

Thanks for the review, please see the replies inline.

sys/x86/x86/busdma_bounce.c
82–85 ↗(On Diff #1874)

Right, making it a two element array of bus_dma_segment_t looks better.

569–570 ↗(On Diff #1874)

Although this should be the most common case, I don't think we should take it for granted. If for example maxsegsz is 512B we could iterate several times over the same input page.

770–842 ↗(On Diff #1874)

I'm not sure I'm following you here, mainly because I don't have that much experience with storage drivers. AFAICT you suggest to have a single bounce_bus_dmamap_load that has a memdesc as an argument, so it will replace all the bounce_bus_dmamap_load_* functions out there?

I agree that a lot of code can be shared between the current implementation of bounce_bus_dmamap_load_buffer and the proposed implementation of bounce_bus_dmamap_load_ma, and I wouldn't mind unifying them.

Regarding the second part, don't we already have bus_dmamap_load_mem to deal with virtual SG lists in the busdma code?

royger edited edge metadata.

Hello,

On this version I've done some major rework of the busdma API. I've tried to
glue together all the _bus_dmamap_load_* functions into a single one, that
gets passed a memdesc.

This memdesc can only be of type MEMDESC_VLIST or MEMDESC_PLIST, so
everything is translated in the generic MI code (subr_bus_dma.c), and
arch-dependant handlers only have to deal with this two types.

This is an early prototype that only works on amd64 (and breaks DMAR
and other arches), I've tested it on my boxes successfully so far.

Do you have a newer version you'd like me to test as well?

Changes:

  • Switch back to the original approach. While I agree that using memdesc would provably allow for a better and cleaner implementation I don't think I will have time to implement it in the near future, and unmapped io is quite important for blkfront performance when running on VMs with a high number of vCPUS.
  • Rebased the patch on top of HEAD.
  • Use bus_dmamap_load_ma_triv to bounce physical pages if BUS_DMA_KEEP_PG_OFFSET is specified. This is needed in order to keep the offset.
  • Use an array to store the original pages.

@robak: if you want to try it, I've created a branch on my git repo that
contains this patch, plus a patch that enables unmapped io for blkfront:

http://xenbits.xen.org/gitweb/?p=people/royger/freebsd.git;a=shortlog;h=refs/heads/unmapped_io

IIRC you had some builders running on Xen, could you try to measure the
performance difference with and without this patch applied?

Thanks, Roger.

kib edited edge metadata.
kib added inline comments.
sys/x86/x86/busdma_bounce.c
1177 ↗(On Diff #9283)

Assert that addr2 is page-aligned ? It is currently some consequence of how addr2 is calculated, but might become not true when somebody else inherits the code.

This revision is now accepted and ready to land.Oct 11 2015, 9:04 AM

I like the memdesc idea, or at least something like it. I also agree that it's too tall an order to do everywhere right now, especially if Xen needs this right away.

There are some busdma cleanups I want to make once my support-unmapped-buffers-everywhere crusade is over. Some of that involves getting rid of duplicate code, and I think the memdesc approach would fit in nicely with that. But I'm about to change jobs, so who knows when that's going to happen.

jah edited edge metadata.

@royger Sorry for asking, but how can I test it? Do you have a patch that could be applied on top of -CURRENT sources?

Thinking about this some more, I wonder if it would be better to have add_bounce_page() do coalescing in much the same way as _bus_dmamap_addseg() already does. You'd still have the 2-page array in struct bounce_page, but add_bounce_page() would go back to just taking one address. It would look at the tail of map->bpages (if that exists). If that last bounce page can still fit the new segment with the right alignment, then its datacount will be increased and datapage[1] will be set if necessary. Otherwise, a new bounce page will be pulled from the bz queue.

With that scheme, you wouldn't need the dedicated load_ma or count_ma functions; existing load_phys and load_buffer would "do the right thing". We'd also make more efficient use of bounce pages when magsegz is much smaller than a page. Right now, if maxsegsz is, say, 512 and those 512-byte segments get bounced, each one will waste a whole bounce page. Of course, I doubt that's a common use case.

A downside would be that you'd probably need to duplicate some of the coalescing logic in count_phys and count_pages to avoid over-requesting bounce pages.

Just a thought I had; there are probably holes in that scheme that I haven't thought of.

One of the patches fails to apply for me:

[root@pd]❯❯❯ svn patch /tmp/roy1                                                                                                                           src/svn/head:289313
C         sys/x86/x86/busdma_bounce.c
>         applied hunk @@ -526,6 +526,51 @@ with offset 1
>         applied hunk @@ -700,7 +745,7 @@ with offset 1
>         applied hunk @@ -719,6 +764,90 @@ with offset 1
>         applied hunk @@ -762,6 +891,7 @@ with offset 1
>         applied hunk @@ -777,17 +907,35 @@ with offset 1
>         applied hunk @@ -797,17 +945,35 @@ with offset 1
>         applied hunk @@ -971,7 +1137,7 @@ with offset 1
>         applied hunk @@ -1001,12 +1167,15 @@ with offset 1
>         applied hunk @@ -1078,7 +1247,7 @@ with offset 1
>         rejected hunk @@ -631,7 +676,7 @@
Summary of conflicts:
  Text conflicts: 1
[root@pd]❯❯❯
In D888#80533, @jah wrote:

Thinking about this some more, I wonder if it would be better to have add_bounce_page() do coalescing in much the same way as _bus_dmamap_addseg() already does. You'd still have the 2-page array in struct bounce_page, but add_bounce_page() would go back to just taking one address. It would look at the tail of map->bpages (if that exists). If that last bounce page can still fit the new segment with the right alignment, then its datacount will be increased and datapage[1] will be set if necessary. Otherwise, a new bounce page will be pulled from the bz queue.

With that scheme, you wouldn't need the dedicated load_ma or count_ma functions; existing load_phys and load_buffer would "do the right thing". We'd also make more efficient use of bounce pages when magsegz is much smaller than a page. Right now, if maxsegsz is, say, 512 and those 512-byte segments get bounced, each one will waste a whole bounce page. Of course, I doubt that's a common use case.

A downside would be that you'd probably need to duplicate some of the coalescing logic in count_phys and count_pages to avoid over-requesting bounce pages.

Just a thought I had; there are probably holes in that scheme that I haven't thought of.

The solution you are proposing seems fine, but I don't think I will have time to look into it until a couple of weeks, do you mind if I commit this now so we can get unmapped IO for blkfront?

In D888#82509, @royger wrote:
In D888#80533, @jah wrote:

Thinking about this some more, I wonder if it would be better to have add_bounce_page() do coalescing in much the same way as _bus_dmamap_addseg() already does. You'd still have the 2-page array in struct bounce_page, but add_bounce_page() would go back to just taking one address. It would look at the tail of map->bpages (if that exists). If that last bounce page can still fit the new segment with the right alignment, then its datacount will be increased and datapage[1] will be set if necessary. Otherwise, a new bounce page will be pulled from the bz queue.

With that scheme, you wouldn't need the dedicated load_ma or count_ma functions; existing load_phys and load_buffer would "do the right thing". We'd also make more efficient use of bounce pages when magsegz is much smaller than a page. Right now, if maxsegsz is, say, 512 and those 512-byte segments get bounced, each one will waste a whole bounce page. Of course, I doubt that's a common use case.

A downside would be that you'd probably need to duplicate some of the coalescing logic in count_phys and count_pages to avoid over-requesting bounce pages.

Just a thought I had; there are probably holes in that scheme that I haven't thought of.

The solution you are proposing seems fine, but I don't think I will have time to look into it until a couple of weeks, do you mind if I commit this now so we can get unmapped IO for blkfront?

Yes, my acceptance of the review still stands.

This revision was automatically updated to reflect the committed changes.