The title says it all. Essentially, I copied this from amd64, and tweaked the comments.
Mark, can you please verify that this works as intended.
I'll test this and will try to get it committed upstream (I have a couple other issues to raise with upstream lld anyway). I'll defer to dim@ and emaste@ on whether it's ok to update our local import in the meantime.
Mark, as an aside, you'll note that nearby DefaultMaxPageSize is being initialized to 64 KB. This corresponds to the 64 KB page size implemented via ATTR_CONTIGUOUS that we discussed yesterday.
I'm totally fine with this, after you've tried it out. I assume there will be no ill effects, only some (theoretical?) performance gain. :)
Note that upstream will probably ask for some sort of test case.
This seems to do the trick:
[root@markj /usr/src]# readelf -l $(which clang) Elf file type is EXEC (Executable file) Entry point 0x12d0000 There are 9 program headers, starting at offset 64 Program Headers: Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align PHDR 0x0000000000000040 0x0000000000200040 0x0000000000200040 0x00000000000001f8 0x00000000000001f8 R 0x8 LOAD 0x0000000000000000 0x0000000000200000 0x0000000000200000 0x00000000010c7ed8 0x00000000010c7ed8 R 0x10000 LOAD 0x00000000010d0000 0x00000000012d0000 0x00000000012d0000 0x0000000002814404 0x0000000002814404 R E 0x10000 LOAD 0x00000000038f0000 0x0000000003af0000 0x0000000003af0000 0x0000000000012530 0x0000000000296959 RW 0x10000 TLS 0x0000000003900000 0x0000000003b00000 0x0000000003b00000 0x0000000000001800 0x0000000000001820 R 0x10 GNU_RELRO 0x0000000003900000 0x0000000003b00000 0x0000000003b00000 0x0000000000002530 0x0000000000002530 R 0x1 GNU_EH_FRAME 0x00000000010b9248 0x00000000012b9248 0x00000000012b9248 0x0000000000002cdc 0x0000000000002cdc R 0x1 GNU_STACK 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 0x0000000000000000 RW 0 NOTE 0x0000000000000238 0x0000000000200238 0x0000000000200238 0x0000000000000030 0x0000000000000030 R 0x4 Section to Segment mapping: Segment Sections... 00 01 .note.tag .rodata .gcc_except_table .eh_frame_hdr .eh_frame 02 .text .init .fini 03 .data .tdata .tbss .ctors .dtors .jcr .fini_array .init_array .data.rel.ro .got .bss 04 .tdata .tbss 05 .tdata .tbss .ctors .dtors .jcr .fini_array .init_array .data.rel.ro .got 06 .eh_frame_hdr 07 08 .note.tag
I can't easily benchmark at the moment since one of the recent stand/ changes seems to
have made ThunderX hang after attempting to boot the kernel. Once I get that resolved
I'll try to get a measure of how much r336558 plus this change helps with build
I'd love to see 3 data points: (1) misaligned text per the unpatched lld, (2) patched lld, and (3) patched lld where you run "clang -v; dd if=/usr/bin/clang of=/dev/null" before the build.
Just so that folks understand what exactly this change will and won't do, let me share an example from amd64 of clang compiling sqlite3.c:
64775 0x200000 0x1577000 r-- 2135 10288 42 14 CNS- vn /usr/bin/cpp 64775 0x1577000 0x3ff3000 r-x 8124 10288 42 14 CNS- vn /usr/bin/cpp 64775 0x3ff3000 0x4000000 rw- 13 0 1 0 C--- vn /usr/bin/cpp 64775 0x4000000 0x4282000 rw- 106 106 1 0 ---- df 64775 0x803ff3000 0x803ff4000 r-- 1 1 60 0 ---- dv 64775 0x803ff4000 0x804000000 rw- 12 12 1 0 ---- df 64775 0x804000000 0x804800000 rw- 928 928 1 0 ---- df 64775 0x804800000 0x804f0d000 r-- 1805 1816 1 0 CNS- vn /local/HEAD/src/contrib/sqlite3/sqlite3.c 64775 0x804f0d000 0x804f15000 r-- 8 8 7 0 CN-- vn /local/HEAD/src/sys/sys/cdefs.h 64775 0x804f15000 0x8051fc000 rw- 724 724 1 0 ---- df 64775 0x8051fc000 0x805201000 r-- 5 5 7 0 CN-- vn /local/obj/local/HEAD/src/amd64.amd64/tmp/usr/include/stdio.h 64775 0x805201000 0x805545000 rw- 835 835 1 0 ---- df 64775 0x805545000 0x80554a000 r-- 5 5 7 0 CN-- vn /local/obj/local/HEAD/src/amd64.amd64/tmp/usr/include/unistd.h 64775 0x80554a000 0x810251000 rw- 44163 44163 1 0 --S- df 64775 0x7fffdfffe000 0x7fffdffff000 --- 0 0 0 0 ---- -- 64775 0x7fffdffff000 0x7ffffffdf000 --- 0 0 0 0 ---- -- 64775 0x7ffffffdf000 0x7ffffffff000 rw- 19 19 1 0 ---D df 64775 0x7ffffffff000 0x800000000000 r-x 1 1 61 0 ---- ph
Even with the alignment of the image base, the 2 MB region in which lld's new-fangled, R/O section ends and the actual executable section begins will never be mapped as a superpage because of the mixed protections. And, also, there is the 2 MB region in which the executable section ends and the R/W section begins that will never be mapped as a superpage.
One of my group members at Rice has verified that lld is handling the maxpagesize option reasonably. Specifically, with maxpagesize set to 2 MB on the command line, lld will superpage align the start of the executable and R/W data sections. Moreover, these sections will be aligned within the executable file. Thus, hypothetically, we could map the entire R/O section and executable section, including the residual at the end of the section, as superpages by including the filler that lld placed between the sections as part of the mapping. I'm not advocating this as a default; it's just nice to know that it works.
As aside, on AMD Ryzen, the boundary between the R/O and executable sections is less of an issue, because Ryzen automatically, in hardware, performs promotion of 8 4 KB mappings into a 32 KB mapping in the TLB. However, I haven't yet found out if this feature is also implemented by the instruction TLB. And, on arm, we could start using 64 KB page mappings ...
Of course only now I notice that this changes these alignments unconditionally for all platforms, e.g. for Linux too, and that can't be right (at least not without some form of checks by Linux people). For us it doesn't matter, but upstreaming this has just become a little harder.
amd64 already had the same change, done in the same way. I don't know who did it, and whether it was upstreamed.
The Linux people may not protest. On Linux, for some applications, people will mmap() anonymous, superpage-backed memory, literally bcopy the text section of the program into said memory, and then mremap() that memory in place of the original text mapping. Alternatively, if the hugetlbfs is configured, I believe that you can copy the executable file into the hugetlbfs and execute it from there. Both approaches still require superpage alignment by the linker to work.
I've finally started this (the issue fixed in r338538 took me some time). The difference between (2) and (3) is quite drastic. I tried 10 buildkernels for each, with a tmpfs objdir.
[root@markj /usr/src]# for i in $(seq 1 10); do time MAKEOBJDIRPREFIX=/mnt make -s -j96 buildkernel KERNCONF=GENERIC > /dev/null; done real 4m17.623s user 257m50.633s sys 29m43.637s real 4m9.036s user 256m53.274s sys 26m29.234s real 4m12.738s user 258m30.383s sys 25m58.623s real 4m9.473s user 256m45.523s sys 26m31.764s real 4m9.740s user 255m52.250s sys 27m17.324s real 4m14.311s user 256m34.307s sys 26m14.742s real 4m11.281s user 256m58.522s sys 26m15.626s real 4m12.312s user 257m8.723s sys 26m31.318s real 4m14.251s user 256m14.241s sys 27m26.014s real 4m9.822s user 258m22.915s sys 25m58.800s [root@markj /usr/src]# dd if=/usr/bin/clang of=/dev/null 116758+1 records in 116758+1 records out 59780480 bytes transferred in 0.465699 secs (128367160 bytes/sec) [root@markj /usr/src]# for i in $(seq 1 10); do time MAKEOBJDIRPREFIX=/mnt make -s -j96 buildkernel KERNCONF=GENERIC > /dev/null; done real 3m12.018s user 171m16.475s sys 20m16.526s real 3m16.206s user 170m45.558s sys 20m15.027s real 3m13.500s user 170m5.103s sys 21m30.634s real 3m7.786s user 172m31.562s sys 19m56.916s real 3m13.529s user 170m32.468s sys 20m5.821s real 3m9.776s user 169m6.229s sys 20m30.748s real 3m13.377s user 171m50.037s sys 19m48.087s real 3m14.374s user 169m30.466s sys 20m50.513s real 3m10.696s user 170m12.373s sys 20m9.038s real 3m14.567s user 169m14.456s sys 21m9.324s
For (1) we have:
[root@markj /usr/src]# for i in $(seq 1 10); do time MAKEOBJDIRPREFIX=/mnt make -s -j96 buildkernel KERNCONF=GENERIC > /dev/null; done real 4m30.259s user 271m13.182s sys 31m30.211s real 4m23.461s user 271m43.929s sys 27m36.849s real 4m24.995s user 272m29.959s sys 27m16.874s real 4m20.690s user 274m21.102s sys 27m51.578s real 4m29.365s user 273m33.744s sys 27m35.246s real 4m29.337s user 271m38.234s sys 28m2.824s real 4m24.275s user 270m14.715s sys 28m22.154s real 4m27.625s user 271m7.266s sys 28m28.089s real 4m22.187s user 271m12.794s sys 28m49.765s real 4m18.496s user 272m3.653s sys 28m18.414s
In (1), we will have zero code segment superpage mappings. The reduction in run time between (1) and (2) suggests that there are some "naturally-occurring" superpage mappings once we have properly aligned the code. This is consistent with what I see on amd64. The effect is just larger. To close the gap between (2) and (3), we'll have to introduce heuristics that will do the extra page-ins, even though the pages are not a part of a normally accessed chunk of the address space.
Has anyone looked at armv7, where I believe lld is now the default? My impression is that the default image base there is 64 KB. If so, I would argue for updating it before we roll 12.0.