Page MenuHomeFreeBSD

mpr: Don't consume entire DMA bounce buffer memory
AcceptedPublic

Authored by tpearson_raptorengineering.com on Thu, Feb 5, 5:50 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Feb 23, 9:58 PM
Unknown Object (File)
Sun, Feb 22, 7:28 PM
Unknown Object (File)
Sun, Feb 22, 6:40 PM
Unknown Object (File)
Fri, Feb 20, 5:00 AM
Unknown Object (File)
Fri, Feb 20, 12:47 AM
Unknown Object (File)
Thu, Feb 19, 7:11 AM
Unknown Object (File)
Thu, Feb 19, 12:18 AM
Unknown Object (File)
Wed, Feb 18, 11:24 AM

Details

Reviewers
jhibbits
slm
ziaee
imp
chs
Group Reviewers
manpages
Summary

The buffer DMA tag was set to 32 bit maximum bus space, or roughly
4GB of memory. Not only is this excessive, but it also consumes
the entire bounce buffer region in low memory, starving any other
DMA-capable devices that attempt to attach after mpr0 attaches.

Add a tunable for DMA buffer size, and set to 64 * MAXPHYS as a
sane default.

Tested to function normally on a Raptor Computing Systems Blackbird
with an LSI SAS3008 controller and Seagate 500GB SATA drive.

Signed-off-by: Timothy Pearson <tpearson@raptorengineering.com>
Sponsored-by: Raptor Computing Systems, LLC

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

Example boot, hotplug, and hot remove of disk device with this patch applied. Note that without this patch the driver won't load, since ~2GB of bounce buffers are allocated and the DMA template tag request fails.

mpr0: <Avago Technologies (LSI) SAS3008> mem 0x80140000-0x8014ffff,0x80100000-0x8013ffff irq 1040376 at device 0.0 numa-domain 0 on pci4
pci4: child mpr0 requested type 3 for rid 0x10, but the BAR says it is an ioport
mpr0: attempting to allocate 1 MSI-X vectors (96 supported)
mpr0: using IRQ 35637240 for MSI-X
mpr0: Firmware: 16.00.01.00, Driver: 23.00.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>
mpr0: Found device <0,No Device> <3.0Gbps> handle<0x0009> enclosureHandle<0x0000> slot 0
(probe0:mpr0:0:0:0): Down reving Protocol Version from 4 to 0?
pass1 at mpr0 bus 0 scbus0 target 0 lun 0
da0 at mpr0 bus 0 scbus0 target 0 lun 0
mpr0: mprsas_prepare_remove: Sending reset for target ID 0
da0 at mpr0 bus 0 scbus0 target 0 lun 0
pass1 at mpr0 bus 0 scbus0 target 0 lun 0
(pass1:mpr0:0:0:0): Periph destroyed
(da0:mpr0:0:0:0): Periph destroyed
pass1 at mpr0 bus 0 scbus0 target 0 lun 0
da0 at mpr0 bus 0 scbus0 target 0 lun 0
(da0:mpr0:0:0:0): CACHE PAGE TOO SHORT data len 222 desc len 222
(da0:mpr0:0:0:0): Mode page 8 missing, disabling SYNCHRONIZE CACHE
mpr0: No pending commands: starting remove_device for target 0 handle 0x0009
mpr0: clearing target 0 handle 0x0009

So 16MB is too small if this is shared across all transactions. We run with MAXPHYS at 2MB and often have many 2MB transactions in flight for the same drive, and 24 drives in a system...

Understood. What would a sane value be here? 4GB is way too large (as an example, the allocation we actually get on a server that doesn't have a ton of other cards installed is "only" ~1.8GB), would 128MB or 512MB be workable?

Understood. What would a sane value be here? 4GB is way too large (as an example, the allocation we actually get on a server that doesn't have a ton of other cards installed is "only" ~1.8GB), would 128MB or 512MB be workable?

Ideally, we'd make it be a tuneable that would default to, say 64 * MAXPHYS. We could tune that higher. Or failing that, a kernel option we could compile to a higher value. 64MB likely would be plenty for most people, even many folks that tune the system or have a fair number of drives. We'd likely set it to 256MB, but I'd have to go look to see how much we have in flight at any time.

Updated to move this to a tunable. From what I can see MAXPHYS is only 128k, and is not available to drivers as a constant. As a result, we actually end up only allocating 8MB by default; if I missed something just let me know and I'll update the MAXPHYS to match...

The man subsystem has a lot of lore, if you tag manpages when your patch changes a manual, we'll help you with it! It's easier for us if you use -U9999 so we can get the full context in phabricator.

share/man/man4/mpr.4
228–232
To set the maximum size of the DMA buffer in bytes, set the
.Va dev.mpr.X.dma_buffer_size
tunable in
.Xr loader.conf 5.

Sysctls and tunables are always tagged with Va (except when appearing in SYNOPSIS), that way you can search the manual for them with apropos Va=here.is.the.tunable.

Update manpage and fix tunable registration

The man subsystem has a lot of lore, if you tag manpages when your patch changes a manual, we'll help you with it! It's easier for us if you use -U9999 so we can get the full context in phabricator.

Great to know, thanks! The syntax is indeed quite arcane, did the best I could do to start with...

manual changes LGTM.

Digging deeper I see this manual needs it's SYSCTLS and TUNABLES refactored. I'll do that after this is done so nobody has to rebase.

share/man/man4/mpr.4
233

Oops, sorry, dma_buffer_size on this line should also be:

The default
.Va dma_buffer_size
value is...

for rendering consistency.

This revision is now accepted and ready to land.Thu, Feb 5, 7:26 PM

Updated to move this to a tunable. From what I can see MAXPHYS is only 128k, and is not available to drivers as a constant. As a result, we actually end up only allocating 8MB by default; if I missed something just let me know and I'll update the MAXPHYS to match...

On LP64 platforms (such as POWER9), MAXPHYS is 1MB (see sys/sys/_maxphys.h). But could you instead just make it 64 * maxphys (the tuned variable)?

sys/dev/mpr/mpr.c
1892

You can make this CTLFLAG_RDTUN, then you don't need to add that TUNABLE_INT_FETCH up at line 1824, as the sysctl machinery will do the work for you. Just move this up earlier in the mpr_attach() function, somewhere before the mpr_iocfacts_allocate() call.

sys/dev/mpr/mpr.c
1892

I'm aware of CTLFLAG_RDTUN, but didn't want to change style from the rest of this file. Which approach should I take, make a one-off different sysctl or follow the existing style?

On LP64 platforms (such as POWER9), MAXPHYS is 1MB (see sys/sys/_maxphys.h). But could you instead just make it 64 * maxphys (the tuned variable)?

Whoops! I'd missed that header when I was trying to avoid redefining MAXPHYS locally. Will do!

This revision now requires review to proceed.Fri, Feb 6, 7:47 PM
jhibbits added a reviewer: imp.

Looks fine to me, but I want Warner's opinion, too.

sys/dev/mpr/mpr.c
1892

Ah, without the extra context (git diff -U99999) I missed the TUNABLE_FETCH() of the same sysctl in the above block.

IMO that other block needs to be replaced with the CTLFLAG_RDTUN, for all the sysctls, but that's not your problem here.

This revision is now accepted and ready to land.Sun, Feb 8, 7:14 PM

I shared this with Chuck Silvers and he had some concerns I thought I understood, but that I'm having trouble articulating, so I'm adding him to the review rather than holding things up by puzzling through them and maybe getting it wrong..

In D55132#1261784, @imp wrote:

I shared this with Chuck Silvers and he had some concerns I thought I understood, but that I'm having trouble articulating, so I'm adding him to the review rather than holding things up by puzzling through them and maybe getting it wrong..

Would definitely like to hear those concerns -- without this patch, the entire low memory space is consumed by DMA bounce buffers, and all other devices (AHCI, network, etc.) are starved, meaning we generally fail to boot on any OpenPOWER hardware where an LSI SAS card is physically installed (!).

The question I have is why any bounce buffer memory is being allocated at all by the creation of this tag. The code that sets up the template for creating buffer_dmat does not specify any constraints that would need bounce-buffering. Only low/highaddr and alignment can cause bounce-buffering to be needed, and the template does not set any of those. So the constraint that leads to the allocation of bounce-buffer memory must be inherited from some parent device. The only one I see that looks possibly relevant for your system is the one in opalpci_attach(), but that is supposedly only for POWER8 systems and your Blackbird system is supposed to be a POWER9 system. So I think it would help to find where the constraint that is triggering bounce-buffering is coming from.

I tried loading the mpr driver after boot as a module on an amd64 system here, and confirmed that on this platform no bounce-buffer memory is allocated during driver attach even with the existing code. I would be surprised if a POWER9 system really had a DMA limitation like this. Hopefully we'll find that the constraint is not actually needed and we can just remove it from whatever code is adding it, and then the mpr driver would behave on your system the same way it does on amd64, not allocating any bounce memory at all during driver attach.

In D55132#1263488, @chs wrote:

I tried loading the mpr driver after boot as a module on an amd64 system here, and confirmed that on this platform no bounce-buffer memory is allocated during driver attach even with the existing code. I would be surprised if a POWER9 system really had a DMA limitation like this. Hopefully we'll find that the constraint is not actually needed and we can just remove it from whatever code is adding it, and then the mpr driver would behave on your system the same way it does on amd64, not allocating any bounce memory at all during driver attach.

So the constraint we have is that the upper half of the 32-bit address space is consumed by other hardware functions (notably PCIe MMIO), which on a true 64-bit system shouldn't be much of a limitation. While things sort of "work" on POWER9 with the DMA tag restriction removed, this is by chance, and depending on the hardware configuration of the system these regions will conflict, causing a system hang (full lockup, no activity) as soon as DMA starts to an overlapping device. As a result, I'm running a kernel with the DMA tag turned back on, which resolves that problem but introduces the one I'm seeing here.

What I don't understand is why x86 works while we don't. A 4GB region isn't going to fit in 4GB low memory with any other allocations, even on x86, so I assume it's going into 64-bit memory somehow? On our platform, from what I can see the DMA tag doesn't lock out 64-bit memory space, and if I'm reading mpr correctly we can go up to 33 bit addressing, so I'm a bit unclear as to why we're not being punted above 4GB vs. allocating a low memory bounce buffer.

Note for reference I'm also running with D54745 and D55084 applied. It's entirely possible I missed something important in one of those patches.

I just tried the stock kernel again without either of those revisions (so no DMA tag created), and the bounce buffer fails to allocate, causing mpr0 to not load:

mpr0: <Avago Technologies (LSI) SAS3008> mem 0x80140000-0x8014ffff,0x80100000-0x8013ffff irq 1040376 at device 0.0 numa-domain 0 on pci4
mpr0: Firmware: 16.00.01.00, Driver: 23.00.00.00-fbsd
mpr0: IOCCapabilities: 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc,FastPath,RDPQArray>
mpr0: IOC Facts allocation failed with error 12
device_attach: mpr0 attach returned 12

The issue must be deeper within the DMA subsystem, it's not as simple as the DMA tag in opalpci_attach(). I'll keep digging but any thoughts are welcome here. Thanks!

OK, so tracing the code flows here I'm guessing this works on x86 only because IOMMU support is enabled on that platform. I would expect we would see the same problem on arm64, riscv64, and any other platforms without IOMMU support enabled. Also it looks like this is one of the only drivers in tree to use BUS_DMA_TEMPLATE_FILL.

Is there any way to test the LSI cards on arm64 to confirm my suspicion? While we do have intentions to enable the POWER9 IOMMU, this will take some time, and we want to be able to provide server resources to FreeBSD which use the LSI controllers vs. forcing the use of NVMe or similar on our platform in the interim.

IOMMU support exists for amd64, but it is not enabled by default, and no bounce buffer memory is allocated by mpr attach with IOMMU support disabled. and mpr works fine without any IOMMU.

I didn't realize that you had other patches under review that you were using together with this one. With your change to enable the creation of the DMA tag in opalpci_attach() for POWER9 as well as POWER8, it makes sense that this is where the address constraint would be coming from. However, since you still get bounce buffer memory allocated by mpr even without that patch, there must be something else that is also adding a constraint that is triggering bounce buffering. Are you sure that that DMA tag in opalpci_attach() is not being created for your platform in the existing code? If the constraint isn't coming from there then we still need to figure out where it's coming from.

The MMIO address range should not need to be explicitly excluded from any bus_dma tag because according to the memory map that you showed in D54745, there is no normal RAM in that range that could be involved in DMA, so no bounce buffer memory allocated because of the constraint that excludes that range would ever be used. However, since the change that added that tag in opalpci_attach() via D17601 reportedly did fix some problem on the POWER8 system that the change was made for, there has to be some other device that does set up DMA via bus_dma to some address in that range. I have no idea what that other driver would be though. In the other review you described the contents of this range below 4GB has "32-bit MMIO window", but I thought that would be things like device registers, which don't really make sense to access via DMA from a device using bus_dma. Could you explain what is in this address range that does makes sense to access via DMA from a device?