Page MenuHomeFreeBSD

Make the default image base on AArch64 and i386 superpage-aligned
ClosedPublic

Authored by alc on Jul 21 2018, 5:38 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Dec 26, 8:16 PM
Unknown Object (File)
Thu, Dec 26, 7:24 AM
Unknown Object (File)
Sat, Dec 21, 4:31 PM
Unknown Object (File)
Dec 6 2024, 6:08 PM
Unknown Object (File)
Dec 1 2024, 7:15 PM
Unknown Object (File)
Nov 29 2024, 4:08 AM
Unknown Object (File)
Nov 16 2024, 10:07 AM
Unknown Object (File)
Nov 11 2024, 2:59 AM
Subscribers

Details

Summary

The title says it all. Essentially, I copied this from amd64, and tweaked the comments.

Test Plan

Mark, can you please verify that this works as intended.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

markj added reviewers: dim, emaste.

I'll test this and will try to get it committed upstream (I have a couple other issues to raise with upstream lld anyway). I'll defer to dim@ and emaste@ on whether it's ok to update our local import in the meantime.

This revision is now accepted and ready to land.Jul 21 2018, 5:42 PM

Mark, as an aside, you'll note that nearby DefaultMaxPageSize is being initialized to 64 KB. This corresponds to the 64 KB page size implemented via ATTR_CONTIGUOUS that we discussed yesterday.

I'll test this and will try to get it committed upstream (I have a couple other issues to raise with upstream lld anyway). I'll defer to dim@ and emaste@ on whether it's ok to update our local import in the meantime.

Thanks.

I'm totally fine with this, after you've tried it out. I assume there will be no ill effects, only some (theoretical?) performance gain. :)

Note that upstream will probably ask for some sort of test case.

This seems to do the trick:

[root@markj /usr/src]# readelf -l $(which clang)

Elf file type is EXEC (Executable file)
Entry point 0x12d0000
There are 9 program headers, starting at offset 64

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flg    Align
  PHDR           0x0000000000000040 0x0000000000200040 0x0000000000200040
                 0x00000000000001f8 0x00000000000001f8  R      0x8
  LOAD           0x0000000000000000 0x0000000000200000 0x0000000000200000
                 0x00000000010c7ed8 0x00000000010c7ed8  R      0x10000
  LOAD           0x00000000010d0000 0x00000000012d0000 0x00000000012d0000
                 0x0000000002814404 0x0000000002814404  R E    0x10000
  LOAD           0x00000000038f0000 0x0000000003af0000 0x0000000003af0000
                 0x0000000000012530 0x0000000000296959  RW     0x10000
  TLS            0x0000000003900000 0x0000000003b00000 0x0000000003b00000
                 0x0000000000001800 0x0000000000001820  R      0x10
  GNU_RELRO      0x0000000003900000 0x0000000003b00000 0x0000000003b00000
                 0x0000000000002530 0x0000000000002530  R      0x1
  GNU_EH_FRAME   0x00000000010b9248 0x00000000012b9248 0x00000000012b9248
                 0x0000000000002cdc 0x0000000000002cdc  R      0x1
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0
  NOTE           0x0000000000000238 0x0000000000200238 0x0000000000200238
                 0x0000000000000030 0x0000000000000030  R      0x4

 Section to Segment mapping:
  Segment Sections...
   00
   01     .note.tag .rodata .gcc_except_table .eh_frame_hdr .eh_frame
   02     .text .init .fini
   03     .data .tdata .tbss .ctors .dtors .jcr .fini_array .init_array .data.rel.ro .got .bss                                                               
   04     .tdata .tbss
   05     .tdata .tbss .ctors .dtors .jcr .fini_array .init_array .data.rel.ro .got                                                                          
   06     .eh_frame_hdr
   07
   08     .note.tag

I can't easily benchmark at the moment since one of the recent stand/ changes seems to
have made ThunderX hang after attempting to boot the kernel. Once I get that resolved
I'll try to get a measure of how much r336558 plus this change helps with build
performance.

Does anyone know for sure what the default image base is on armv7 and i386? My impression from glancing at the source code is that the default is 64 KB.

In D16385#351461, @alc wrote:

Does anyone know for sure what the default image base is on armv7 and i386? My impression from glancing at the source code is that the default is 64 KB.

Looking at executables on i386, it appears to be 0x10000, so 64KB.

The aforementioned ThunderX issue is resolved now; benchmarking this change is still on my todo list.

In D16385#351461, @alc wrote:

Does anyone know for sure what the default image base is on armv7 and i386? My impression from glancing at the source code is that the default is 64 KB.

Looking at executables on i386, it appears to be 0x10000, so 64KB.

The aforementioned ThunderX issue is resolved now; benchmarking this change is still on my todo list.

I'd love to see 3 data points: (1) misaligned text per the unpatched lld, (2) patched lld, and (3) patched lld where you run "clang -v; dd if=/usr/bin/clang of=/dev/null" before the build.

In D16385#351461, @alc wrote:

Does anyone know for sure what the default image base is on armv7 and i386? My impression from glancing at the source code is that the default is 64 KB.

Looking at executables on i386, it appears to be 0x10000, so 64KB.

If the switch to lld on i386 is changing the default image base, then we might as well be changing it to a superpage-aligned address.

In D16385#351802, @alc wrote:

If the switch to lld on i386 is changing the default image base, then we might as well be changing it to a superpage-aligned address.

It appears that right now we get 0x08000000 with -fuse-ld=bfd (2.17.50) and 0x00010000 with lld 6.0.0, and we ought to make the suggested change on i386.

alc retitled this revision from Make the default image base on AArch64 superpage-aligned to Make the default image base on AArch64 and i386 superpage-aligned.

Extend the patch to i386.

This revision now requires review to proceed.Aug 3 2018, 7:05 PM

This is fine with me, and I have no objection to it going into FreeBSD first and @dim@ or @markj or me working on upstream.

This revision is now accepted and ready to land.Aug 3 2018, 7:13 PM

There is a new regression causing my thunderx to fail to mount root. :( I'm trying to track that down now, and will get to testing after that.

Just so that folks understand what exactly this change will and won't do, let me share an example from amd64 of clang compiling sqlite3.c:

64775           0x200000          0x1577000 r-- 2135 10288  42  14 CNS- vn /usr/bin/cpp
64775          0x1577000          0x3ff3000 r-x 8124 10288  42  14 CNS- vn /usr/bin/cpp
64775          0x3ff3000          0x4000000 rw-   13    0   1   0 C--- vn /usr/bin/cpp
64775          0x4000000          0x4282000 rw-  106  106   1   0 ---- df
64775        0x803ff3000        0x803ff4000 r--    1    1  60   0 ---- dv
64775        0x803ff4000        0x804000000 rw-   12   12   1   0 ---- df
64775        0x804000000        0x804800000 rw-  928  928   1   0 ---- df
64775        0x804800000        0x804f0d000 r-- 1805 1816   1   0 CNS- vn /local/HEAD/src/contrib/sqlite3/sqlite3.c
64775        0x804f0d000        0x804f15000 r--    8    8   7   0 CN-- vn /local/HEAD/src/sys/sys/cdefs.h
64775        0x804f15000        0x8051fc000 rw-  724  724   1   0 ---- df
64775        0x8051fc000        0x805201000 r--    5    5   7   0 CN-- vn /local/obj/local/HEAD/src/amd64.amd64/tmp/usr/include/stdio.h
64775        0x805201000        0x805545000 rw-  835  835   1   0 ---- df
64775        0x805545000        0x80554a000 r--    5    5   7   0 CN-- vn /local/obj/local/HEAD/src/amd64.amd64/tmp/usr/include/unistd.h
64775        0x80554a000        0x810251000 rw- 44163 44163   1   0 --S- df
64775     0x7fffdfffe000     0x7fffdffff000 ---    0    0   0   0 ---- --
64775     0x7fffdffff000     0x7ffffffdf000 ---    0    0   0   0 ---- --
64775     0x7ffffffdf000     0x7ffffffff000 rw-   19   19   1   0 ---D df
64775     0x7ffffffff000     0x800000000000 r-x    1    1  61   0 ---- ph

Even with the alignment of the image base, the 2 MB region in which lld's new-fangled, R/O section ends and the actual executable section begins will never be mapped as a superpage because of the mixed protections. And, also, there is the 2 MB region in which the executable section ends and the R/W section begins that will never be mapped as a superpage.

One of my group members at Rice has verified that lld is handling the maxpagesize option reasonably. Specifically, with maxpagesize set to 2 MB on the command line, lld will superpage align the start of the executable and R/W data sections. Moreover, these sections will be aligned within the executable file. Thus, hypothetically, we could map the entire R/O section and executable section, including the residual at the end of the section, as superpages by including the filler that lld placed between the sections as part of the mapping. I'm not advocating this as a default; it's just nice to know that it works.

As aside, on AMD Ryzen, the boundary between the R/O and executable sections is less of an issue, because Ryzen automatically, in hardware, performs promotion of 8 4 KB mappings into a 32 KB mapping in the TLB. However, I haven't yet found out if this feature is also implemented by the instruction TLB. And, on arm, we could start using 64 KB page mappings ...

This revision was automatically updated to reflect the committed changes.

Of course only now I notice that this changes these alignments unconditionally for all platforms, e.g. for Linux too, and that can't be right (at least not without some form of checks by Linux people). For us it doesn't matter, but upstreaming this has just become a little harder.

In D16385#352474, @dim wrote:

Of course only now I notice that this changes these alignments unconditionally for all platforms, e.g. for Linux too, and that can't be right (at least not without some form of checks by Linux people). For us it doesn't matter, but upstreaming this has just become a little harder.

amd64 already had the same change, done in the same way. I don't know who did it, and whether it was upstreamed.

The Linux people may not protest. On Linux, for some applications, people will mmap() anonymous, superpage-backed memory, literally bcopy the text section of the program into said memory, and then mremap() that memory in place of the original text mapping. Alternatively, if the hugetlbfs is configured, I believe that you can copy the executable file into the hugetlbfs and execute it from there. Both approaches still require superpage alignment by the linker to work.

In D16385#351470, @alc wrote:

I'd love to see 3 data points: (1) misaligned text per the unpatched lld, (2) patched lld, and (3) patched lld where you run "clang -v; dd if=/usr/bin/clang of=/dev/null" before the build.

I've finally started this (the issue fixed in r338538 took me some time). The difference between (2) and (3) is quite drastic. I tried 10 buildkernels for each, with a tmpfs objdir.

[root@markj /usr/src]# for i in $(seq 1 10); do time MAKEOBJDIRPREFIX=/mnt make -s -j96 buildkernel KERNCONF=GENERIC > /dev/null; done
                   
real    4m17.623s   
user    257m50.633s                                              
sys     29m43.637s                                                                                                                    
                                                     
real    4m9.036s 
user    256m53.274s          
sys     26m29.234s

real    4m12.738s                                 
user    258m30.383s                
sys     25m58.623s                                   
                                                                                      
real    4m9.473s
user    256m45.523s
sys     26m31.764s

real    4m9.740s
user    255m52.250s
sys     27m17.324s

real    4m14.311s
user    256m34.307s
sys     26m14.742s

real    4m11.281s
user    256m58.522s
sys     26m15.626s

real    4m12.312s
user    257m8.723s
sys     26m31.318s

real    4m14.251s
user    256m14.241s
sys     27m26.014s

real    4m9.822s
user    258m22.915s
sys     25m58.800s 
[root@markj /usr/src]# dd if=/usr/bin/clang of=/dev/null
116758+1 records in
116758+1 records out
59780480 bytes transferred in 0.465699 secs (128367160 bytes/sec)
[root@markj /usr/src]# for i in $(seq 1 10); do time MAKEOBJDIRPREFIX=/mnt make -s -j96 buildkernel KERNCONF=GENERIC > /dev/null; done

real    3m12.018s
user    171m16.475s
sys     20m16.526s

real    3m16.206s
user    170m45.558s
sys     20m15.027s

real    3m13.500s
user    170m5.103s 
sys     21m30.634s

real    3m7.786s
user    172m31.562s
sys     19m56.916s

real    3m13.529s
user    170m32.468s
sys     20m5.821s

real    3m9.776s
user    169m6.229s
sys     20m30.748s

real    3m13.377s
user    171m50.037s
sys     19m48.087s


real    3m14.374s
user    169m30.466s
sys     20m50.513s

real    3m10.696s
user    170m12.373s
sys     20m9.038s

real    3m14.567s
user    169m14.456s
sys     21m9.324s

For (1) we have:

[root@markj /usr/src]# for i in $(seq 1 10); do time MAKEOBJDIRPREFIX=/mnt make -s -j96 buildkernel KERNCONF=GENERIC > /dev/null; done

real    4m30.259s
user    271m13.182s
sys     31m30.211s

real    4m23.461s
user    271m43.929s
sys     27m36.849s

real    4m24.995s
user    272m29.959s
sys     27m16.874s

real    4m20.690s
user    274m21.102s
sys     27m51.578s

real    4m29.365s
user    273m33.744s
sys     27m35.246s

real    4m29.337s
user    271m38.234s
sys     28m2.824s

real    4m24.275s
user    270m14.715s
sys     28m22.154s

real    4m27.625s
user    271m7.266s
sys     28m28.089s

real    4m22.187s
user    271m12.794s
sys     28m49.765s

real    4m18.496s
user    272m3.653s
sys     28m18.414s

In (1), we will have zero code segment superpage mappings. The reduction in run time between (1) and (2) suggests that there are some "naturally-occurring" superpage mappings once we have properly aligned the code. This is consistent with what I see on amd64. The effect is just larger. To close the gap between (2) and (3), we'll have to introduce heuristics that will do the extra page-ins, even though the pages are not a part of a normally accessed chunk of the address space.

Has anyone looked at armv7, where I believe lld is now the default? My impression is that the default image base there is 64 KB. If so, I would argue for updating it before we roll 12.0.