Paths

Table of Contentst

busdma: Avoid leaking bounce pages when destroy DMA tags
AcceptedPublic
Actions

Authored by markj on Nov 11 2024, 10:44 PM.

Details

Reviewers

mhorne
jhb
andrew
manu
imp
kib
skra

Summary

busdma tags may have an attached "bounce zone" which manages bounce page
allocations. However, it does not get cleaned up when the owning busdma
tag is destroyed. Fix the leak:

Add a destroy_bounce_zone() implementation and modify MD busdma code to use it.
Make sure that bounce_zone_list accesses are locked by the bounce mutex, previously there was no explicit serialization (though the bus_topo lock probably made this relatively safe).
Add refcounting to bounce zones so that busdma tags can safely share them.

PR: 278569
MFC after: 1 month

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Skipped

Unit

Tests Skipped

Build Status

Buildable 60577
Build 57461: arc lint + arc unit

Event Timeline

markj created this revision.Nov 11 2024, 10:44 PM

Herald added a reviewer: andrew. · View Herald TranscriptNov 11 2024, 10:44 PM

Herald added a reviewer: andrew. · View Herald Transcript

Herald added a reviewer: manu. · View Herald Transcript

Herald added subscribers: riscv, olce, jhibbits and 2 others. · View Herald Transcript

markj requested review of this revision.Nov 11 2024, 10:44 PM

Harbormaster completed remote builds in B60494: Diff 146295.Nov 11 2024, 10:45 PM

imp accepted this revision.Nov 12 2024, 12:06 AM

This revision is now accepted and ready to land.Nov 12 2024, 12:06 AM

Thanks for handling this one. I had looked at it previously, and the one thing I was uncertain about was the interaction with busdma_thread()/deferred loads. It appears to me there is one unlocked reference to bz->deferred_time which could race with bus_dmamap_destroy(), although it would be exceedingly rare. Can you comment on this?

In D47521#1084047, @mhorne wrote:

Thanks for handling this one. I had looked at it previously, and the one thing I was uncertain about was the interaction with busdma_thread()/deferred loads. It appears to me there is one unlocked reference to bz->deferred_time which could race with bus_dmamap_destroy(), although it would be exceedingly rare. Can you comment on this?

I think this race is not new with this patch. That loop can also race with bus_dmamap_destroy() and bus_dma_tag_destroy().

More generally, I believe it's only a problem if consumers misuse the busdma interface. A map gets added to the bounce zone's bounce_map_waitinglist when a bus_dmamap_load() fails to reserve enough bounce pages (presumably because another consumer is using them), and BUS_DMA_WAITOK was specified. Later, when a bus_dmamap_unload() call releases some bounce pages owned by the same zone, the busdma layer can try again to reserve pages; this happens in the busdma thread so that the callback is executed in a safe context.

To trigger the race, a consumer would have to call bus_dmamap_load(callback, BUS_DMA_WAITOK), get EINPROGRESS, and then destroy the map and tag (releasing the last reference on the bounce zone) before the callback is invoked. I'd argue that it's the consumer's responsibility to make sure that this doesn't happen, though the documentation doesn't explicitly state this. If so, I believe there is no bug here.

In D47521#1084058, @markj wrote:

In D47521#1084047, @mhorne wrote:

Thanks for handling this one. I had looked at it previously, and the one thing I was uncertain about was the interaction with busdma_thread()/deferred loads. It appears to me there is one unlocked reference to bz->deferred_time which could race with bus_dmamap_destroy(), although it would be exceedingly rare. Can you comment on this?

I think this race is not new with this patch. That loop can also race with bus_dmamap_destroy() and bus_dma_tag_destroy().

More generally, I believe it's only a problem if consumers misuse the busdma interface. A map gets added to the bounce zone's bounce_map_waitinglist when a bus_dmamap_load() fails to reserve enough bounce pages (presumably because another consumer is using them), and BUS_DMA_WAITOK was specified. Later, when a bus_dmamap_unload() call releases some bounce pages owned by the same zone, the busdma layer can try again to reserve pages; this happens in the busdma thread so that the callback is executed in a safe context.

To trigger the race, a consumer would have to call bus_dmamap_load(callback, BUS_DMA_WAITOK), get EINPROGRESS, and then destroy the map and tag (releasing the last reference on the bounce zone) before the callback is invoked. I'd argue that it's the consumer's responsibility to make sure that this doesn't happen, though the documentation doesn't explicitly state this. If so, I believe there is no bug here.

Agreed; thank you!

FWIW, I do think it was the intended design originally to never destroy these things as the thought was that you wouldn't be unloading the drivers that used them if your system used bounce pages at all. That assumption may no longer hold though and this seems fine to me.

sys/kern/subr_busdma_bounce.c
209–212	This might now start multiple kthreads which would be a problem (I think). That is, you might destroy the all the tags referencing a zone drain this list back to empty, then create a new tag that needs a zone and you will create another kthread. Instead, this probably needs another sentinel, maybe just a global `static bool kthread_started;` or the like.
220–221	Note that in this case we don't actually clean up the zone but leave it lying around, and we also fail to start the kthread if needed. We'd be better off just panicking here I think.
290	If you do the STAILQ_REMOVE from `bounce_zone_list` first you can drop the lock sooner. It isn't really needed for any of this other cleanup you are doing.

jhb added inline comments.Nov 14 2024, 4:07 PM

sys/kern/subr_busdma_bounce.c
176	This was racey before in that it could create multiple zones with the same constraints.
577	Given the lock here, I guess it's ok if there are multiple kthreads, but kind of pointless to have them.

Handle John's comments.

This revision now requires review to proceed.Nov 15 2024, 2:40 AM

Harbormaster completed remote builds in B60577: Diff 146489.Nov 15 2024, 2:40 AM

jhb accepted this revision.Nov 15 2024, 5:02 PM

This revision is now accepted and ready to land.Nov 15 2024, 5:02 PM

mhorne accepted this revision.Nov 15 2024, 5:11 PM

markj added inline comments.Nov 18 2024, 9:16 PM

sys/x86/x86/busdma_bounce.c
332	So, it turns out that this patch doesn't fully fix the problem of leaking bounce pages. The root of the problem is here: every time a map is created, we preallocate at least one page for its use, no matter how many free pages are already in the zone. In particular, we don't know how many other maps/tags are sharing the zone, so we don't know exactly how many pages we might actually need. When destroying the map/tag, we don't free any pages from the zone, so a loop which creates a tag+map requiring bounce pages, and then destroys them without destroying the zone will leak pages. In particular, my patch fixes the reproducer from bugzilla PR 278569, but the reproducer could be modified slightly to reintroduce the leak (e.g., by having a second busdma tag which holds a reference on the bounce zone). I wonder if the solution might be to simply get rid of this MIN_ALLOC_COMP mechanism. In particular, this flag is stored in the busdma tag, which might have many DMA maps associated with it, so it only applies to the first map. Do we really need it?

jhb added inline comments.Nov 22 2024, 2:34 PM

sys/x86/x86/busdma_bounce.c

332

Hmmm, I think it's the pages = MAX(pages, 1) that gets you into trouble?

It looks like the purpose of the BUS_DMA_MIN_ALLOC_COMP flag is that we avoid doing the allocation for the first map if we allocated bounce pages when creating the tag via BUS_DMA_ALLOC_NOW to avoid double allocating for the first map.

I wonder if the || really should &&. The current || is due to a revert:

commit eae22c443014034ab08894be5cb03940869ec977
Author: Svatopluk Kraus <skra@FreeBSD.org>
Date:   Mon Nov 23 11:19:00 2015 +0000

    Revert r291142.
    
    The not quite consistent logic for bounce pages allocation is utilizited
    by re(4) interface which can hang now.
    
    Approved by:    kib (mentor)

Notes:
    svn path=/head/; revision=291193

That reverted commit seems to be fixing this exact edge case, and I'm not sure what it needs to be reverted TBH:

commit 6fa7734d6fbbec1e34bfee33427969ac9a92ff80
Author: Svatopluk Kraus <skra@FreeBSD.org>
Date:   Sat Nov 21 19:55:01 2015 +0000

    Fix BUS_DMA_MIN_ALLOC_COMP flag logic. When bus_dmamap_t map is being
    created for bus_dma_tag_t tag, bounce pages should be allocated
    only if needed.
    
    Before the fix, they were allocated always if BUS_DMA_COULD_BOUNCE flag
    was set but BUS_DMA_MIN_ALLOC_COMP not. As bounce pages are never freed,
    it could cause memory exhaustion when a lot of such tags together with
    their maps were created.
    
    Note that there could be more maps in one tag by current design.
    However BUS_DMA_MIN_ALLOC_COMP flag is tag's flag. It's set after
    bounce pages are allocated. Thus, they are allocated only for first
    tag's map which needs them.
    
    Approved by:    kib (mentor)

Notes:
    svn path=/head/; revision=291142

I wonder what is up with re(4). (And why does a network driver need bounce pages!!!!)

jhb added reviewers: kib, skra.Nov 22 2024, 2:34 PM

If re(4)'s issue is that it is using bounce pages for RX buffers and that it needs more bounce pages, then I think it is too broken to use bus_dma for its RX buffers. Instead, it just needs to break down and use contigmalloc or something and copy packets into mbufs that it passes up the network stack. Bounce buffers for RX mbufs is the same thing in terms of the cost of memory copies and memory overhead, but round tripping through bus_dma to achieve that would be terrible overhead. I can't really imagine that re(4) is so broken though? It would be good to understand the case where re(4) needs bounce pages. If it is for transmit, then maybe we just need a way for re(4) to explicitly ask for at least one bounce page per map via a new flag on the tag rather than breaking all tags for everyone else.

markj added inline comments.Nov 26 2024, 5:21 PM

sys/x86/x86/busdma_bounce.c
332	Yes, it's `pages = MAX(pages, 1)` that causes problems, sorry for not being clear. And, apparently it was a bug report from me that caused the revert. I had a system with two re interfaces but different devices. The problematic one was a RTL8169, which of course has since died. Looking at if_re.c, I suspect that the difference is that for non-PCIe devices, the driver sets `lowaddr = BUS_SPACE_MAXADDR_32BIT` for its root busdma tag, so it'd have `BUS_DMA_COULD_BOUNCE` set. (And, this particular system had 8 or 16GB of RAM so might indeed end up bouncing.) After the revert, skra@ sent me a follow-up patch which apparently fixed the problem but never got committed for some reason. The condition for deciding whether to allocate bounce pages effectively becomes: if ((dmat->bounce_flags & BUS_DMA_MIN_ALLOC_COMP) == 0 \|\| (bz->map_count > 0 && bz->total_bpages < maxpages)) and I believe that would also fix the leak.

jhb added inline comments.Nov 27 2024, 7:38 PM

sys/x86/x86/busdma_bounce.c
332	Isn't the code you quoted above the current code? Did you paste the wrong thing from the follow-up patch?

markj added inline comments.Nov 27 2024, 7:42 PM

sys/x86/x86/busdma_bounce.c
332	Sigh, yes, sorry. It should be: if (((dmat->bounce_flags & BUS_DMA_MIN_ALLOC_COMP) == 0 \|\| bz->map_count > 0) && bz->total_bpages < maxpages)

jhb added inline comments.Nov 29 2024, 3:18 PM

sys/x86/x86/busdma_bounce.c
332	So the effect is to always honor `bz->total_pages < maxpages`? It also seems like this means you don't need the `pages = MAX(pages, 1)` line after that since `pages = MIN(maxpages - bz->total_bpages, pages);` will now always ensure `pages` is non-zero.

markj added inline comments.Dec 23 2024, 9:06 PM

sys/x86/x86/busdma_bounce.c
332	Yes. I think I see why the commit was reverted: it changed the check for "do I reserve bounce pages?" to: if ((dmat->bounce_flags & BUS_DMA_MIN_ALLOC_COMP) == 0 && bz->map_count > 0 && bz->total_bpages < maxpages) but that's too strict. If we didn't prealloc bounce pages at tag creation time, and if this is the first busdma mapping attached to the bounce zone, then we won't reserve pages since bz->map_count == 0. I posted https://reviews.freebsd.org/D48182 to fix this.
342	Yet another bug: this increment needs to be atomic, but there's nothing serializing it aside from the bus topology lock (assuming that busdma maps are created from DEVICE_ATTACH, which is usually but not always true).