Page MenuHomeFreeBSD

Use memory locality info provided in the FDT.
AcceptedPublic

Authored by markj on Jul 13 2018, 7:12 PM.
Tags
None
Referenced Files
Unknown Object (File)
Dec 24 2023, 9:04 AM
Unknown Object (File)
Dec 23 2023, 1:39 AM
Unknown Object (File)
Nov 19 2023, 2:24 PM
Unknown Object (File)
Nov 8 2023, 12:50 PM
Unknown Object (File)
Nov 6 2023, 6:52 PM
Unknown Object (File)
Nov 6 2023, 1:35 PM
Unknown Object (File)
Nov 6 2023, 12:47 PM
Unknown Object (File)
Nov 6 2023, 5:57 AM
Subscribers

Details

Reviewers
andrew
jhb
imp
alc
manu
mhorne
Group Reviewers
arm64
Summary

On the ThunderX, the FDT provides NUMA locality info in the device nodes
corresponding to ranges of physical memory. This change follows the
example of the x86 SRAT parsing code and uses a SYSINIT to search for
this info and populate the mem_info table before the VM is initialized.

I refactored fdt_get_mem_regions() to avoid duplicating code. In so
doing, I changed it to iterate over all children of the root node, since
multiple memory nodes may exist. (This is akin to ofw_mem_regions() in
the powerpc platform code.) On the ThunderX we use the memory map from
EFI to populate phys_avail, so this problem wasn't noticed.

I don't attempt anything like renumber_domains() in the x86 code, since
there are multiple devices with a numa-node-id property, and we don't
perform that compaction elsewhere.

Test Plan

On my ThunderX I get:

[root@markj ~]# sysctl vm.phys_segs
vm.phys_segs:
SEGMENT 0:

start:     0xd00000
end:       0x1000000
domain:    0
free list: 0xffff000000ccc840

SEGMENT 1:

start:     0x1000000
end:       0xfffec000
domain:    0
free list: 0xffff000000ccc600

SEGMENT 2:

start:     0x100000000
end:       0xfff100000
domain:    0
free list: 0xffff000000ccc600

SEGMENT 3:

start:     0x10000400000
end:       0x10f16f80000
domain:    1
free list: 0xffff000000ccca80

SEGMENT 4:

start:     0x10feae21000
end:       0x10ffe74d000
domain:    1
free list: 0xffff000000ccca80

SEGMENT 5:

start:     0x10ffffff000
end:       0x11000000000
domain:    1
free list: 0xffff000000ccca80

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 36929
Build 33818: arc lint + arc unit

Event Timeline

markj edited the test plan for this revision. (Show Details)
markj added reviewers: andrew, jhb, imp, alc.
sys/dev/fdt/fdt_common.c
653

VM_PHYSSEG_MAX + 1? Isn't there code that looks for an empty entry as a terminator?

Does this architecture actually need VM_FREELIST_ISADMA?

In D16260#345214, @alc wrote:

Does this architecture actually need VM_FREELIST_ISADMA?

I'm not sure, but I suspect not.

sys/dev/fdt/fdt_common.c
653

Looking at vm_phys_domain_match() and vm_phys_create_seg() (the only routines that look at the affinity table), I can't see why we need the extra entry. srat.c has it though, hmm.

In D16260#345214, @alc wrote:

Does this architecture actually need VM_FREELIST_ISADMA?

From my reading of the code I don't think so, but I'm unsure it's intended purpose. I would expect riscv also doesn't need it.

In D16260#345214, @alc wrote:

Does this architecture actually need VM_FREELIST_ISADMA?

From my reading of the code I don't think so, but I'm unsure it's intended purpose.

It ensures that the first 16MB of physical memory is segregated from the rest and won't be allocated by the generic page allocator unless nothing else is available. Allocations by drivers that need memory in that range are thus more likely to succeed.

I've been thinking about replacing fdt_get_mem_region with fdt_foreach_mem_region to iterate over all memory regions and run a callback on each. It might pay to do this for get_mem_regions so we don't need to add a new fixed size struct fdt_mem_region array.

sys/dev/fdt/fdt_common.c
582

You can use root = OF_peer(0); to find the root.

In D16260#345214, @alc wrote:

Does this architecture actually need VM_FREELIST_ISADMA?

From my reading of the code I don't think so, but I'm unsure it's intended purpose.

It ensures that the first 16MB of physical memory is segregated from the rest and won't be allocated by the generic page allocator unless nothing else is available. Allocations by drivers that need memory in that range are thus more likely to succeed.

There is not guarantee on arm64 there will be physical memory in the first 16MB so those drivers are unlikely to work on arm64 without an IOMMU.

In D16260#345214, @alc wrote:

Does this architecture actually need VM_FREELIST_ISADMA?

From my reading of the code I don't think so, but I'm unsure it's intended purpose. I would expect riscv also doesn't need it.

VM_FREELIST_ISADMA is designed for the case of the floppy controller (or really old sound cards) that need DMA buffers in the first 16MB due to architectural limits. This limit is hard wired in vm_phys.c to be 16MB.

arm64 and riscv categorically do not need this. I'm surprised sparc64 does, but I see it has a fdc device.

They might need VM_FREELIST_DMA32, however, if they support more than 4GB and have devices that require pages in the first 4GB to do DMA.

On a related note: Why have both VM_FREELIST_ISADMA and VM_FREELIST_LOWMEM? They are redundant. Only MIPS defines LOWMEM, while others define ISADMA. Why not collapse the two cases and have the MD includes define VM_LOWMEM_BOUNDARY?

sparc64 does not need ISADMA either. There's actually no fdc driver that needs it.

sys/dev/fdt/fdt_common.c
653

Are you sure? Isn't the outer loop in vm_phys_domain_match()

/*
 * Check for any memory that overlaps low, high.
 */
for (i = 0; mem_affinity[i].end != 0; i++)

looking for a terminator?

markj marked an inline comment as done.

Add a terminator in the mem_affinity table.

sys/dev/fdt/fdt_common.c
653

Thanks, not sure how I missed that. :(

markj marked an inline comment as done.

Use OF_peer(0);

In D16260#345608, @imp wrote:
In D16260#345214, @alc wrote:

Does this architecture actually need VM_FREELIST_ISADMA?

From my reading of the code I don't think so, but I'm unsure it's intended purpose. I would expect riscv also doesn't need it.

VM_FREELIST_ISADMA is designed for the case of the floppy controller (or really old sound cards) that need DMA buffers in the first 16MB due to architectural limits. This limit is hard wired in vm_phys.c to be 16MB.

It also helps with gadgets that want physical memory within the first 1 MB. For example, I recall interacting with gallatin@ over such a gadget two or three years ago.

I've been thinking about replacing fdt_get_mem_region with fdt_foreach_mem_region to iterate over all memory regions and run a callback on each. It might pay to do this for get_mem_regions so we don't need to add a new fixed size struct fdt_mem_region array.

After looking at ofw_cpu_early_foreach(), I had considered that and mostly implemented it before going this route. To me a callback-based approach seems less suitable here since we're just filling out an array. In particular, there's some state that needs to be passed between consecutive invocations of the callback (the current array index), and it felt a bit ugly to implement both fdt_get_mem_regions() and the domain enumeration code that way, since both effectively just fill out arrays.

  • Remove errant modification to fdt_get_reserved_regions().

I had forgotten about this. Any objection to the patch, modulo my comment? @mhorne reports that it fixes booting on a multi-socket ThunderX when booting from an FDT. Otherwise, because the gicv3 driver's bus_get_domain() method returns the "true" NUMA domain ID, we attempt allocations from different NUMA domains with malloc_domainset(), but the VM system only knows about a single domain.

sys/dev/fdt/fdt_common.c
658

This should use vm_phys_register_domains(). That function didn't exist when this patch was written.

Use vm_phys_register_domains().

sys/dev/fdt/fdt_common.c
646

You will also need to propagate the results from fdtmr to mr.

Actually fill out the mem_region array.

sys/dev/fdt/fdt_common.c
647

This won't work for the case where memsize == NULL, which is true for !mips callers.

659

Is there any possibility of mischief from running both this and parse_pxm_tables? It does seem possible (but unlikely) that ACPI locality info could be overwritten by valid but less complete info from the DT.

markj added inline comments.
sys/dev/fdt/fdt_common.c
659

Yes, that seems possible. Might be that we should have an arm64 sysinit which calls fdt_get_mem_domains() and registers the info with vm_phys only if arm64_bus_method == ARM64_BUS_FDT.

Export fdt_get_mem_domains() and use it from initarm() instead of
a sysinit.

Handle a null memsize pointer correctly.

The latest version of the patch works on ThunderX, and fdt_get_mem_regions() works as expected on riscv.

The parsing code looks good to me.

This revision is now accepted and ready to land.Feb 12 2021, 5:23 PM
sys/dev/fdt/fdt_common.c
581

I don't think that this is correct.
Not all device_type memory should be taken into account here, some fpga related stuff have device_type set to memory but those should not be parsed by this code.
See sys/contrib/device-tree/Bindings/fpga/fpga-region.txt
I might have missed other stuff like that too.
The previous code that parse only /memory seems to be correct according to the device tree spec v0.3 (https://github.com/devicetree-org/devicetree-specification/releases/tag/v0.3)

sys/dev/fdt/fdt_common.c
602

Shouldn't regp be incremented?