Page MenuHomeFreeBSD

vm_reserv: Sparsify the vm_reserv_array when VM_PHYSSEG_SPARSE
ClosedPublic

Authored by scottph on Aug 20 2020, 6:52 AM.

Details

Summary

On an Ampere Altra system, the physical memory is populated
sparsely within the physical address space, with only about 0.4%
of physical addresses backed by RAM in the range [0, last_pa].

This is causing the vm_reserv_array to be over-sized by a few
orders of magnitude, wasting roughly 5 GiB on a system with
256 GiB of RAM.

The sparse allocation of vm_reserv_array is controlled by defining
VM_PHYSSEG_SPARSE, with the dense allocation still remaining for
platforms with VM_PHYSSEG_DENSE.

MFC after: 1 week
Sponsored by: Ampere Computing, Inc.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Instead of modifying vm_phys.c to have some knowledge of superpage reservations, could we instead allocation a reservation array per vm_phys_seg, at least in the VM_PHYSSEG_SPARSE case? I think that would keep the layering clean, and make it easy to solve the issue I pointed out in vm_reserv_from_page().

sys/vm/vm_reserv.c
1 ↗(On Diff #76018)

Stray newline.

503 ↗(On Diff #76018)

So here we are going to loop over the phys_seg array, but the vm_page already contains the segment index for the segment to which it belongs.

Instead of modifying vm_phys.c to have some knowledge of superpage reservations, could we instead allocation a reservation array per vm_phys_seg, at least in the VM_PHYSSEG_SPARSE case? I think that would keep the layering clean, and make it easy to solve the issue I pointed out in vm_reserv_from_page().

I reorganized the change so vm_phys.h now has a first_reserv field in vm_phys_seg, and all other logic is within vm_reserv.c

Thanks, I like this change much better.

sys/vm/vm_phys.h
72 ↗(On Diff #76185)

Will this compile if VM_NRESERVLEVEL == 0?

sys/vm/vm_reserv.c
237 ↗(On Diff #76185)

Style: the opening brace should be on its own line.

1442 ↗(On Diff #76185)

This looks incorrect since vm_phys_segs[] and phys_avail[] are basically the same set. phys_avail[] is populated earlier during boot and used to initialize vm_phys_segs[].

Looking at vm_page_startup(), I can kind of see why it was written this way: the vm_phys_segs[] array is not fully populated at this point. I think we should just be looping over phys_avail[] here.

sys/vm/vm_reserv.c
248 ↗(On Diff #76185)

Use #%jx instead of 0x%jx

1431 ↗(On Diff #76185)

int again

sys/vm/vm_reserv.c
503 ↗(On Diff #76018)

Even though the iteration has been eliminated, I'd rather not see VM_PHYSSEG_DENSE architectures pessimized by the extra complexity. Just like the vm_page_array, which method we use should be selected at compile time by VM_PHYSSEG_DENSE versus VM_PHYSSEG_SPARSE.

237 ↗(On Diff #76185)

This function is defined as static, but I don't see a caller in this file.

I'm curious to see the output from "sysctl vm.phys_segs" on this machine.

In D26130#581703, @alc wrote:

I'm curious to see the output from "sysctl vm.phys_segs" on this machine.

Posted here: https://bsd.to/k5Uw/raw

Kostik, Mark, do you know of any machines with more segments?

In D26130#581705, @alc wrote:

Kostik, Mark, do you know of any machines with more segments ?

It is not a lot of segments. On my relatively low-end desktop with 32G, I have 9 segments.

Very sparse structures are common for amd64 NVDIMMs, which were the reason for pmap_large_map() KPI. But they normally do not count in the usable memory segments.

In D26130#581720, @kib wrote:
In D26130#581705, @alc wrote:

Kostik, Mark, do you know of any machines with more segments ?

It is not a lot of segments. On my relatively low-end desktop with 32G, I have 9 segments.

I have several systems with 10 segments. I'm sure I have seen machines with more than that, but I don't have access to any at the moment.

In D26130#581720, @kib wrote:
In D26130#581705, @alc wrote:

Kostik, Mark, do you know of any machines with more segments ?

It is not a lot of segments. On my relatively low-end desktop with 32G, I have 9 segments.

I have several systems with 10 segments. I'm sure I have seen machines with more than that, but I don't have access to any at the moment.

And on my Apollo Lake home router, I have 10 segs.

One of the factors there is perhaps early boot allocs, which might fragment it.

I have suggested to dougm@ that he evaluate reordering the array as a treap where the priority is based on the size of the segment.

scottph retitled this revision from vm_reserv: Sparsify the vm_reserv_array to vm_reserv: Sparsify the vm_reserv_array when VM_PHYSSEG_SPARSE.
scottph edited the summary of this revision. (Show Details)
scottph added inline comments.
sys/vm/vm_phys.h
72 ↗(On Diff #76185)

Will this compile if VM_NRESERVLEVEL == 0?

72 ↗(On Diff #76185)

It did compile, it looks like vm_reserv_t is typedef'd as a pointer to an opaque struct unconditionally. Fixed here to only be included when VM_NRESERVLEVEL > 0

sys/vm/vm_reserv.c
503 ↗(On Diff #76018)

I've predicated the sparse paths on VM_PHYSSEG_SPARSE

237 ↗(On Diff #76185)

It was vestigial, deleted

1442 ↗(On Diff #76185)

It looks like sys/powerpc/aim/mmu_radix.c and sys/arm/arm/pmap-v6.c can vm_phys_add_seg() an entry into vm_phys_segs before it gets properly populated, if I understand correctly.

Other than the potential overallocation in vm_reserv_startup(), I think that this is fine.

sys/vm/vm_reserv.c
509 ↗(On Diff #76235)

VM_PAGE_TO_PHYS(m) instead of m->phys_addr

1089 ↗(On Diff #76235)

I would suggest initializing seg->first_reserv under VM_PHYSSEG_DENSE even though we don't currently use it.

1442 ↗(On Diff #76185)

The worst case scenario here is that we allocate twice as many vm_reserv structures as we need under VM_PHYSSEG_SPARSE.

scottph marked 3 inline comments as done.
scottph added inline comments.
sys/vm/vm_reserv.c
1442 ↗(On Diff #76185)

Looking more into vm_page.c, specifically lines 650-670 where size is computed with phys_avail and vm_phys_segs, and then the loop on lines 758-793 where (after vm_phys_segs has been fully populated) only memory that came from phys_avail is added to free lists, it appears that the purpose of machdep code putting memory directly in to vm_phys_segs is that it wants it to be represented by a vm_page, but not added to a free list.

I believe the intention is that phys_avail and vm_phys_segs are disjoint at this point in initialization, but there don't seem to be any asserts to ensure that this is the case.

markj added inline comments.
sys/vm/vm_reserv.c
1442 ↗(On Diff #76185)

After re-reading more carefully I think you're right.

This revision is now accepted and ready to land.Aug 27 2020, 4:44 PM
alc added inline comments.
sys/vm/vm_reserv.c
1433 ↗(On Diff #76265)

Similarly to pagecount in vm_page_startup(), this should just be u_long.

(A VM object can be larger than the physical memory size of some machines. On those machines vm_pindex_t is necessary to represent either a page's position within the VM object or the size of the VM object. Here, however, we are dealing with physical memory, and u_long is sufficient to represent the number of physical pages on any given architecture. (That said, in a lot of places we are still using u_int.))

1442 ↗(On Diff #76185)

There are a few situations where we need to have a struct vm_page corresponding to a physical page allocated early in the boot process even though the page will never be added to the free lists, for example, kernel page table pages underlying kernel mappings that might be promoted to a superpage. This is why pmap-v6.c is calling vm_phys_add_seg().

kib added inline comments.
sys/vm/vm_reserv.c
1433 ↗(On Diff #76265)

Here, however, we are dealing with physical memory, and u_long is sufficient to represent the number of physical pages on any given architecture. (That said, in a lot of places we are still using u_int.))

In fact not, recent SDM stopped limiting bits for phys addresses in PAE page tables in 32bit mode. Currently Intel limits it to 40 bit of page index, which is larger then u_long can hold.

sys/vm/vm_reserv.c
1433 ↗(On Diff #76265)

Interesting, I hadn't seen that yet.