Page MenuHomeFreeBSD

arm64: support the ATTR_CONTIGUOUS L3C page size in pagesizes[]
ClosedPublic

Authored by alc on Jun 28 2024, 4:04 AM.
Tags
None
Referenced Files
F107739562: D45766.diff
Fri, Jan 17, 9:47 PM
Unknown Object (File)
Sun, Jan 5, 6:12 PM
Unknown Object (File)
Sun, Jan 5, 3:09 AM
Unknown Object (File)
Sun, Jan 5, 3:06 AM
Unknown Object (File)
Sun, Jan 5, 2:55 AM
Unknown Object (File)
Sun, Jan 5, 2:36 AM
Unknown Object (File)
Dec 17 2024, 10:00 PM
Unknown Object (File)
Dec 17 2024, 5:49 AM

Details

Summary

Update pagesizes[] to include the ATTR_CONTIGUOUS L3C page size, which is 64KB when the base page size is 4KB and 2MB when the base page size is 16KB.

Add support for L3C pages to shm_create_largepage().

Add support for creating L3C page mappings to pmap_enter(psind=1).

Add support for reporting L3C page mappings to mincore(2) and procstat(8), for example,

978     0x7b0b27200000     0x7b0b27398000 r--  280  768  26  10 CNS-- vn /lib/libcrypto.so.30
978     0x7b0b27398000     0x7b0b273a8000 ---    0    0   0   0 CN--- gd 
978     0x7b0b273a8000     0x7b0b275b2000 r-x  386  768  26  10 CNS-- vn /lib/libcrypto.so.30
978     0x7b0b275b2000     0x7b0b275c1000 ---    0    0   0   0 CN--- gd 
978     0x7b0b275c1000     0x7b0b27613000 r--   82    0   2   0 CN--- vn /lib/libcrypto.so.30
978     0x7b0b27613000     0x7b0b27622000 ---    0    0   0   0 CN--- gd 
978     0x7b0b27622000     0x7b0b2762a000 rw-    8    0   2   1 CN--- vn /lib/libcrypto.so.30
978     0x7b0b2762a000     0x7b0b2762c000 rw-    2    2   2   0 CN--- sw

Update vm_fault_soft_fast() and vm_fault_populate() to handle multiple superpage sizes. Consequently, some L3C promotions are converted to L3C mappings created by pmap_enter(psind=1).

Declare arm64 as supporting two superpage reservation sizes, and simulate two superpage reservation sizes, updating the vm page's psind field. (The next patch will replace this simulation. This patch is already big enough.)

Co-authored-by: Eliot Solomon <ehs3@rice.edu>

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

alc requested review of this revision.Jun 28 2024, 4:04 AM

Fix PMAP_ENTER_LARGEPAGE

libexec/rtld-elf/map_object.c
213

I do not understand this change. Why do we need it?

Esp. what if p_offset is not zero, why do we care if it is far enough to cover a superpage?

sys/kern/imgact_elf.c
1364

Why is this a reasonable check? It MAXPAGESIZES > VM_NRESERVLEVEL, the formula selects the max pagesizes index for which there are reservations. But what if e.g MAXPAGESIZES == VM_NRESERVLEVEL on some arch, why fall back to smallest page alignment?

sys/kern/uipc_shm.c
1594

Again, I do not understand the condition. shm largepages should not depend on the reservation levels the kernel provides. It is purely about pmap knowing how to construct PTE for given page size.

For instance, on arm64 which can stop page table walk at any level, there is nothing except a need for some bit of pmap code that prevents us to implement largepages utilizing PTEs at higher levels of page tables. Same for amd64 if it ever start providing for 512G PTEs (at level 4).

libexec/rtld-elf/map_object.c
213

Consider, for example, libc.so, which has a file size less than 2MB. However, the guard size exceeds 2MB because of libc.so's bss section. rtld_round_page(segs[0]->p_filesz) >= pagesizes[1] (64KB) will be true. If that were the only requirement (aside from npagesizes > 1), then we would pointlessly 2MB align the guard even though the file size is too small to later allocate a 2MB reservation for caching the file contents. (In fact, we saw this happening before the new requirement was introduced.) In contrast, libcrypto.so meets both the old and new requirements, and as shown in the summary, we allocate a 2MB reservation and map parts of the R/O data and code with 64KB pages.

P.S. As part of the next big patch that introduces real 2-level reservations, we will add another condition that requests pagesizes[1] (64KB) alignment for files that are less than pagesizes[2] but greater than pagesizes[1]. So, libc.so has 64KB reservations allocated and 64KB page mappings.

sys/kern/imgact_elf.c
1364

MAXPAGESIZES always includes pagesizes[0], but VM_NRESERVLEVEL does not. So, they should never be equal in any sane configuration. Specifically, that would imply that we have a reservation size that is larger than the largest page size. That said, we could eliminate MAXPAGESIZES > VM_NRESERVLEVEL, and instead of pagesizes[VM_NRESERVLEVEL] write pagesizes[MIN(MAXPAGESIZES - 1, VM_NRESERVLEVEL]).

sys/kern/uipc_shm.c
1594

At first, I found this code puzzling too. In the end, we simply sought to maintain the current behavior: MAP_ALIGNED_SUPER relies on pmap_align_superpage(), which only implements alignment for page sizes that match reservation sizes, which previously was only 2MB on amd64 and arm64 (or 32MB on arm64 if the base page size is 16KB). In other words, on neither amd64 nor arm64, does pmap_align_superpage() implement alignment for 1GB pages, which are supported by shm_create_largepage() on both architectures. I inferred that this is why the current code rejects the use of MAP_ALIGNED_SUPER if shm_lp_psind is not 1, i.e., what used to be 2MB on arm64. With this change, we still reject the use of MAP_ALIGNED_SUPER in conjunction with 1GB pages, but not 64KB or 2MB.

P.S. If no explicit alignment directive, e.g., MAP_ALIGNED_SUPER, is passed to shm_mmap_large(), then it automatically requests alignment based on shm_lp_psind.

libexec/rtld-elf/map_object.c
213

Let me add that I expect p_offset for nsegs to be non-zero.

libexec/rtld-elf/map_object.c
213

I still do not understand it. First, you ignore the bss-like segments, which might be larger than p_filesz. Second, I suspect that what you want there to compare against the pagesizes[XXX} is total mapping size, but then it is just mapsize?

sys/kern/imgact_elf.c
1364

Yes please. For me, the last formula looks reasonable.

sys/kern/uipc_shm.c
1594

I see, thank you for the explanation.

libexec/rtld-elf/map_object.c
213

In fact, excluding bss is exactly what I am aiming to do. As I described above, using libc.so as an example, the current code (without this change) will pointlessly 2MB align the start of libc.so. Using mapsize in this change, would do the same, because mapsize includes the bss size.

Without this change, here is an example of what was happening:

604     0x7bbc00400000     0x7bbc00487000 r--   95  392  70  28 CN--- vn /lib/libc.so.7
604     0x7bbc00487000     0x7bbc00496000 ---    0    0   0   0 CN--- gd 
604     0x7bbc00496000     0x7bbc005ca000 r-x  268  392  70  28 CN--- vn /lib/libc.so.7
604     0x7bbc005ca000     0x7bbc005d9000 ---    0    0   0   0 CN--- gd 
604     0x7bbc005d9000     0x7bbc005e3000 r--   10    0   5   0 CN--- vn /lib/libc.so.7
604     0x7bbc005e3000     0x7bbc005f2000 ---    0    0   0   0 CN--- gd 
604     0x7bbc005f2000     0x7bbc005f9000 rw-    7    0   1   0 C---- vn /lib/libc.so.7
604     0x7bbc005f9000     0x7bbc0081b000 rw-   18    0   1   0 C---- sw

The 2MB alignment here accomplishes nothing, except to waste address space and page table memory, because the underlying VM objects are all less than 2MB in size. The decision to 2MB align here should really only be based on file size, not the overall mapsize.

Moreover, the 2MB alignment here doesn't help in terms of creating more superpages under the bss either.

libexec/rtld-elf/map_object.c
213

Let me be more precise in making my last point: In general, 2MB alignment for the overall mapping, i.e., the guard, does not guarantee that the bss will optimally aligned for potential 2MB superpages. The outcome with respect to potential 2MB superpages for the bss is going to be a matter of happenstance. For a given bss start and size, a non-aligned 2MB overall mapping might be lucky, and an aligned 2MB overall mapping unlucky. In other words, we can't really achieve a deterministically better result for the bss based on 2MB alignment for the overall mapping.

sys/vm/vm_map.c
1998–2005

@kib Could you please explain the reasoning behind these values? I'm wondering if that reasoning leads to a smaller value for 64KB pages, rather than duplicating the value for 2MB pages.

libexec/rtld-elf/map_object.c
213

I still do not understand the formula. For instance, p_offset is the file offset, and not VA offset, of the first byte of the segment. This is why I talked about mapsize.

Could you, please, add a comment explaining the purpose of the check? Then, perhaps we can work something out from that.

sys/vm/vm_map.c
1998–2005

The values were chosen to provide a reasonably compact layout for mappings, while still providing the number of random bits in the addresses that satisfied some ASLR verification checks.

I do not remember what the programs to measure the entropy from the mapping addresses were, but I am almost sure that it was emaste@ who run them.

Simplify pmap_enter_largepage()'s handling of L3C pages.

_Static_assert that MAXPAGESIZES is greater than VM_NRESERVLEVEL so as to simplify some if statements.

Maintain the same values for aslr_pages_rnd_*[] when VM_NRESERVLEVEL == 1, such as amd64.

libexec/rtld-elf/map_object.c
213

The objective is simple: To not 2MB align the start of libc.so's mappings, because doing so is pointless. In general, 2MB alignment for libraries whose code and initialized data segments are together less than 2MB will not deterministically yield the possibility of more superpage mappings. It is simply a waste of address space. On the other hand, I do want the start of libcrypto.so's mappings to be 2MB aligned, because as shown in the summary with this change we will now get some 64KB page mappings on both the code and read-only data. Both libc.so and libcrypto.so have mapsizes greater than 2MB. However, only libcrypto.so comes from a file that is greater than 2MB in size, and thus is eligible to have a 2MB reservation allocated to it. That is why I am examining the file offset and size for the last segment.

Previously, we would not have 2MB aligned either libc.so or libcrypto.so because the size of the initial read-only data segment was not greater than 2MB. Now, pagesizes[1] is only 64KB, not 2MB. So, without the new condition based on the file, we would pass MAP_ALIGNED_SUPER for both, and since they both have mapsizes greater than 2MB, both would be 2MB aligned, libc.so pointlessly so.

That said, this exact change to rtld is transitional. When we have the next big patch that introduces "real" 64KB reservations to vm_reserv.c, we do want libc.so to be aligned, but only 64KB aligned. I'm testing that version of the rtld change together with the rest of this patch. If it that testing doesn't yield any inexplicable results, I will introduce it here. Maybe it will be easier to understand.

sys/kern/imgact_elf.c
1364

After thinking this over some more, I'm going to argue for simply having a _Static_assert that MAXPAGESIZES is greater than VM_NRESERVLEVEL.

sys/kern/uipc_shm.c
1594

I'm going to add a comment here.

sys/vm/vm_map.c
1998–2005

@emaste What were these ASLR verification tests?

sys/vm/vm_map.c
1998–2005

I'd hesitate to call them verification tests :)

The original tool of interest here was paxtest, ported to FreeBSD in https://github.com/opntr/paxtest-freebsd. It provides a reported number of bits of entropy, although it's going to be inaccurate if the distribution is not uniform.

There's also https://personales.upv.es/iripoll/aslr_main.html from the folks behind ASLR-NG but AFAIK the code is not available.

libexec/rtld-elf/map_object.c
213

I suggest handling it differently then. Consider calculating a segment with the largest p_filesz when we loop over the segments. before creating the placeholder mapping. After the max p_filesz is known, code can correctly detect whether it would benefit from superpage promotions.

Make vm_map_find() a bit smarter about how much extra space to search for when performing ASLR and either VMFS_SUPER_SPACE or VMFS_OPTIMAL_SPACE is specified.

Add a couple comments.

alc edited the summary of this revision. (Show Details)

Remove what was a transitional (and confusing) rtld change. The proper change will be included in the next big patch that introduces real two-level reservation support. The removed change only sought to avoid the pointless 2MB alignment of libraries, like libc.so, that are too small too benefit from 2MB alignment.

Simplify the vm_fault_soft_fast() and vm_fault_populate() changes.

@kib Since I removed the rtld change, your last comment on that portion of the change no longer appears inline with the patch, so let me address it here.

You wrote, "Consider calculating a segment with the largest p_filesz when we loop over the segments. before creating the placeholder mapping. After the max p_filesz is known, code can correctly detect whether it would benefit from superpage promotions."

We need to handle the first versus subsequent segments differently. For the subsequent segments, we also need to take into consideration p_offset and p_align. The p_filesz might equal or exceed a particular superpage size, but because of p_offset the virtual and physical alignment won't allow a superpage mapping to be created. Consider, for example, a segment with p_filesz == 2M but p_offset is not a multiple of 2M. More generally, for the subsequent segments, we can rarely create (super)page mappings larger than p_align because the virtual and physical alignments don't match. This is due to ELF's file space saving feature of sharing (and thus mapping twice in virtual memory) the portion of the file, i.e., cached physical page, wherein the boundary between the segments resides. Moreover, the misalignment increases with each new segment. In contrast, the first segment can support page sizes greater than p_align. For example, on arm64, where p_align is 64KB, for clang we wind up with some 2MB page mappings within the initial R/O segment, but the subsequent text segment only winds up with 64KB page mappings, even though there is at least one fully populated 2MB reservation/physical page. (We could, of course, "fix" this by linking clang with maxpagesize 2MB.)

Finally, rtld needs to stop using MAP_ALIGNED_SUPER for the placeholder mapping, because it leads to overalignment, e.g., 2MB when 64KB is all we can benefit from. We need to start using MAP_ALIGNED() instead, based on the page size that we compute from looping over the segments.

Are there any other comments or questions about this patch?

kib added inline comments.
sys/arm64/arm64/pmap.c
4979

Might be, for each case of explicit psind == n comparison, this code should assert that pagesizes[n] is of expected size.

Imagine somebody added yet another intermediate page size and did not corrected this if() series.

This revision is now accepted and ready to land.Jul 13 2024, 4:00 AM

Add requested KASSERT()s.

This revision now requires review to proceed.Jul 13 2024, 8:08 AM
alc marked an inline comment as done.Jul 13 2024, 8:08 AM
This revision is now accepted and ready to land.Jul 14 2024, 12:26 AM