Page MenuHomeFreeBSD

Implement support for pmap_enter(psind == 1) for arm64.
ClosedPublic

Authored by markj on Jul 17 2018, 6:06 PM.
Tags
None
Referenced Files
Unknown Object (File)
Dec 22 2023, 11:47 PM
Unknown Object (File)
Dec 16 2023, 6:15 AM
Unknown Object (File)
Dec 12 2023, 9:43 PM
Unknown Object (File)
Dec 9 2023, 11:06 AM
Unknown Object (File)
Sep 26 2023, 7:25 PM
Unknown Object (File)
Sep 25 2023, 10:33 PM
Unknown Object (File)
Sep 23 2023, 3:55 AM
Unknown Object (File)
Aug 6 2023, 7:21 PM

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj added reviewers: alc, kib, andrew.
markj added inline comments.
sys/arm64/arm64/pmap.c
2945 ↗(On Diff #45421)

"parameter"

4959 ↗(On Diff #45421)

This is a bug fix.

It would be nice to have similar superpage counters ("sysctl vm.pmap.pde") as x86.

In D16303#346003, @alc wrote:

It would be nice to have similar superpage counters ("sysctl vm.pmap.pde") as x86.

Are you referring to pde.mappings? We already have the rest:

[root@markj ~]# sysctl vm.pmap.l2
vm.pmap.l2.promotions: 35360
vm.pmap.l2.p_failures: 0
vm.pmap.l2.demotions: 5488

I meant to add it but forgot. Will fix.

In D16303#346003, @alc wrote:

It would be nice to have similar superpage counters ("sysctl vm.pmap.pde") as x86.

Are you referring to pde.mappings? We already have the rest:

[root@markj ~]# sysctl vm.pmap.l2
vm.pmap.l2.promotions: 35360
vm.pmap.l2.p_failures: 0
vm.pmap.l2.demotions: 5488

I meant to add it but forgot. Will fix.

Yes. As a point of comparison, on HEAD/amd64, I see

18551.592u 1062.479s 56:55.94 574.1%    52924+3525k 52553+83337io 15368pf+0w
Mon Jul 16 22:34:55 CDT 2018
vm.pmap.pde.promotions: 61849
vm.pmap.pde.p_failures: 8440
vm.pmap.pde.mappings: 186494
vm.pmap.pde.demotions: 4125
vm.reserv.reclaimed: 0
vm.reserv.partpopq: 
DOMAIN    LEVEL     SIZE  NUMBER

     0,      -1, 245940K,    160

vm.reserv.fullpop: 118
vm.reserv.freed: 1242237
vm.reserv.broken: 0

after a "buildworld". If I "force" almost all of clang's code to be superpages, I see

17938.868u 992.991s 55:29.09 568.6%     52934+3528k 51116+82109io 14078pf+0w
Tue Jul 17 12:28:43 CDT 2018
vm.pmap.pde.promotions: 61938
vm.pmap.pde.p_failures: 8348
vm.pmap.pde.mappings: 2674410
vm.pmap.pde.demotions: 4037
vm.reserv.reclaimed: 0
vm.reserv.partpopq: 
DOMAIN    LEVEL     SIZE  NUMBER

     0,      -1, 219356K,    130

vm.reserv.fullpop: 145
vm.reserv.freed: 1242232
vm.reserv.broken: 0

(Forcing superpages means "clang -v; dd if=/usr/bin/clang of=/dev/null")

I get:

vm.pmap.l2.promotions: 76312
vm.pmap.l2.p_failures: 0
vm.pmap.l2.mappings: 1603
vm.pmap.l2.demotions: 8526

and

vm.pmap.l2.promotions: 76205
vm.pmap.l2.p_failures: 23331119
vm.pmap.l2.mappings: 1778
vm.pmap.l2.demotions: 8855

respectively. I note that arm64 doesn't implement pmap_copy(). clang has the following program headers on this system:

[root@markj /usr/src]# readelf -l $(which clang)
                                                                                                                                                              
Elf file type is EXEC (Executable file)                                                                                                                       
Entry point 0x10e0000       
There are 9 program headers, starting at offset 64
     
Program Headers:                                                     
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flg    Align                                                                                         
  PHDR           0x0000000000000040 0x0000000000010040 0x0000000000010040
                 0x00000000000001f8 0x00000000000001f8  R      0x8                                                                                           
  LOAD           0x0000000000000000 0x0000000000010000 0x0000000000010000
                 0x00000000010c7ed8 0x00000000010c7ed8  R      0x10000
  LOAD           0x00000000010d0000 0x00000000010e0000 0x00000000010e0000
                 0x0000000002813a44 0x0000000002813a44  R E    0x10000
  LOAD           0x00000000038f0000 0x0000000003900000 0x0000000003900000
                 0x0000000000012530 0x0000000000296959  RW     0x10000
  TLS            0x0000000003900000 0x0000000003910000 0x0000000003910000
                 0x0000000000001800 0x0000000000001820  R      0x10
  GNU_RELRO      0x0000000003900000 0x0000000003910000 0x0000000003910000
                 0x0000000000002530 0x0000000000002530  R      0x1
  GNU_EH_FRAME   0x00000000010b9248 0x00000000010c9248 0x00000000010c9248
                 0x0000000000002cdc 0x0000000000002cdc  R      0x1
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0
  NOTE           0x0000000000000238 0x0000000000010238 0x0000000000010238
                 0x0000000000000030 0x0000000000000030  R      0x4

lld has a curious default layout for arm64. A while back, I logged into the arm64 ref machine, and found that a bash executable compiled in, I believe, Feb had a text segment starting at 4M. However, newer executables aren't so nicely aligned, in other words, the initial R/O section doesn't start at a 2M boundary. That's why you're not getting a high mapping count.

Find the following line in kern_exec.c and replace 0 by 16.

vm_object_color(object, 0);
In D16303#346111, @alc wrote:

Find the following line in kern_exec.c and replace 0 by 16.

vm_object_color(object, 0);

That did the trick. Less than a minute into the build I see:

vm.pmap.l2.promotions: 1026
vm.pmap.l2.p_failures: 1
vm.pmap.l2.mappings: 16794
vm.pmap.l2.demotions: 47
  • Add vm.pmap.l2.mappings.
  • Fix typo.

contrib/llvm/tools/lld/ELF/Arch/X86_64.cpp has

template <class ELFT> X86_64<ELFT>::X86_64() {
  GotBaseSymOff = -1;
  CopyRel = R_X86_64_COPY;
  GotRel = R_X86_64_GLOB_DAT;
  PltRel = R_X86_64_JUMP_SLOT;
  RelativeRel = R_X86_64_RELATIVE;
  IRelativeRel = R_X86_64_IRELATIVE;
  TlsGotRel = R_X86_64_TPOFF64;
  TlsModuleIndexRel = R_X86_64_DTPMOD64;
  TlsOffsetRel = R_X86_64_DTPOFF64;
  GotEntrySize = 8;
  GotPltEntrySize = 8;
  PltEntrySize = 16;
  PltHeaderSize = 16;
  TlsGdRelaxSkip = 2;
  TrapInstr = 0xcccccccc; // 0xcc = INT3

  // Align to the large page size (known as a superpage or huge page).
  // FreeBSD automatically promotes large, superpage-aligned allocations.
  DefaultImageBase = 0x200000;
}

The corresponding file for AArch64 doesn't have a DefaultImageBase definition.

Yufeng, see if you can post the clang/dhrystone png here. I suspect that Mark and Kostik will find it interesting.

In D16303#346166, @alc wrote:

The corresponding file for AArch64 doesn't have a DefaultImageBase definition.

Couldn't we use the load address of the image to colour the object such that we don't rely on the linker providing a 2MB-aligned address?

The patch we wrote allows the fault handler to fully populate and promote a reservation once the number of non-resident 64K-aligned chunks (aka "holes") in the reservation falls below some threshold. The test makes Clang compile Dhrystone for 5000 times. The following figure shows how three hardware counters change as we increase the threshold (and thus making the promotion policy more aggressive). The rightmost point on the x axis is where we apply the "dd if=/usr/bin/clang of=/dev/null" trick, and end up getting 25 superpages out of clang's main executable region.

Clang_Dhrystone_ag_code_promo.png (477×771 px, 25 KB)

In D16303#346166, @alc wrote:

The corresponding file for AArch64 doesn't have a DefaultImageBase definition.

Couldn't we use the load address of the image to colour the object such that we don't rely on the linker providing a 2MB-aligned address?

This is a chicken-or-the-egg problem. To determine the load address we have to read the headers from the file, but we have to allocate physical pages to hold the headers. But, for those physical pages to come from a reservation, we need to set the color before their allocation. The alternative is to do reallocation and copying. (Ick!)

Alternatively, we give up on a superpage mapping for the initial part of the file. Essentially, this is what the 0 -> 16 hack did. However, for clang, for example, there is a big promotable chunk of read-only data at the start of the file.

The current layout guarantees that we waste the first 16 PTEs in the L3 page table page and 64KB of virtual address space. Whereas, a 2MB-aligned layout would waste one PTE in the L2 page table page and 2MB of virtual address space. So, I don't really see a disadvantage to the 2MB-aligned layout.

alc added inline comments.
sys/arm64/arm64/pmap.c
1674 ↗(On Diff #45438)

Placing this function before rather than after pmap_alloc_l3() reduces the diff between this pmap and amd64's.

2920 ↗(On Diff #45438)

Similarly, placing pmap_enter_2mpage() and mmap_enter_l2() after pmap_enter() reduces the size of the diff.

This revision is now accepted and ready to land.Jul 20 2018, 12:22 AM
This revision was automatically updated to reflect the committed changes.
markj marked an inline comment as done.
This revision was not accepted when it landed; it landed in state Needs Review.Jul 20 2018, 4:37 PM
This revision was automatically updated to reflect the committed changes.

Diff'ing this pmap with amd64's also shows that it is missing reserve_pv_entries(). My recollection is that I had to add (and use) this function when I started implementing fine-grained PV list locking on amd64. And, arm64 has already replicated that fine-grained PV list locking ...

In D16303#347467, @alc wrote:

Diff'ing this pmap with amd64's also shows that it is missing reserve_pv_entries(). My recollection is that I had to add (and use) this function when I started implementing fine-grained PV list locking on amd64. And, arm64 has already replicated that fine-grained PV list locking ...

It does have one... I fixed a bug there in r335580.

In D16303#347467, @alc wrote:

Diff'ing this pmap with amd64's also shows that it is missing reserve_pv_entries(). My recollection is that I had to add (and use) this function when I started implementing fine-grained PV list locking on amd64. And, arm64 has already replicated that fine-grained PV list locking ...

It does have one... I fixed a bug there in r335580.

Oops. I was also diff'ing arm/pmap-v6.c recently, and must have forgotten which diff I was looking at. :-)

Mark, are you familiar with what ATTR_CONTIGUOUS does?

In D16303#347475, @alc wrote:

Mark, are you familiar with what ATTR_CONTIGUOUS does?

Somewhat, I read about it while researching the format of armv8 page table entries. It seems to allow one to exploit physical contiguity of a range of virtual pages such that the range requires only a single TLB entry, so quite similar to super pages but at a more fine-grained level. I could imagine using the reservation system to detect cases where we could use that attribute.

In D16303#347475, @alc wrote:

Mark, are you familiar with what ATTR_CONTIGUOUS does?

Somewhat, I read about it while researching the format of armv8 page table entries. It seems to allow one to exploit physical contiguity of a range of virtual pages such that the range requires only a single TLB entry, so quite similar to super pages but at a more fine-grained level. I could imagine using the reservation system to detect cases where we could use that attribute.

Yes, exactly. I'm trying to figure out an incremental approach to implementing and testing use of ATTR_CONTIGUOUS. I speculate that step 1 is to teach pmap_{protect,remove}_l3() to recognize ATTR_CONTIGUOUS mappings and demote them, unless a range operation is being performed. I would not attempt to "coalesce" the 16 PV entries into one, just let them be. Then, step 2, the first use case, would be in pmap_enter_object(). When m->psind ==1 but the virtual address range isn't compatible with a 2MB mapping, use ATTR_CONTIGUOUS. This will, for example, be exercised by large, lld-generated executables, like clang, at the boundary between the ro segment and the executable segment.

In D16303#347511, @alc wrote:
In D16303#347475, @alc wrote:

Mark, are you familiar with what ATTR_CONTIGUOUS does?

Somewhat, I read about it while researching the format of armv8 page table entries. It seems to allow one to exploit physical contiguity of a range of virtual pages such that the range requires only a single TLB entry, so quite similar to super pages but at a more fine-grained level. I could imagine using the reservation system to detect cases where we could use that attribute.

Yes, exactly. I'm trying to figure out an incremental approach to implementing and testing use of ATTR_CONTIGUOUS. I speculate that step 1 is to teach pmap_{protect,remove}_l3() to recognize ATTR_CONTIGUOUS mappings and demote them, unless a range operation is being performed. I would not attempt to "coalesce" the 16 PV entries into one, just let them be. Then, step 2, the first use case, would be in pmap_enter_object(). When m->psind ==1 but the virtual address range isn't compatible with a 2MB mapping, use ATTR_CONTIGUOUS. This will, for example, be exercised by large, lld-generated executables, like clang, at the boundary between the ro segment and the executable segment.

That sounds reasonable to me. I guess step 3 would consist of teaching pmap_enter() to check for an aligned, populated subrange of the reservation for the page, and promote the entries if so?

I think pmap_ts_referenced() will need some scheme for handling ATTR_CONTIGUOUS.

In D16303#347511, @alc wrote:
In D16303#347475, @alc wrote:

Mark, are you familiar with what ATTR_CONTIGUOUS does?

Somewhat, I read about it while researching the format of armv8 page table entries. It seems to allow one to exploit physical contiguity of a range of virtual pages such that the range requires only a single TLB entry, so quite similar to super pages but at a more fine-grained level. I could imagine using the reservation system to detect cases where we could use that attribute.

Yes, exactly. I'm trying to figure out an incremental approach to implementing and testing use of ATTR_CONTIGUOUS. I speculate that step 1 is to teach pmap_{protect,remove}_l3() to recognize ATTR_CONTIGUOUS mappings and demote them, unless a range operation is being performed. I would not attempt to "coalesce" the 16 PV entries into one, just let them be. Then, step 2, the first use case, would be in pmap_enter_object(). When m->psind ==1 but the virtual address range isn't compatible with a 2MB mapping, use ATTR_CONTIGUOUS. This will, for example, be exercised by large, lld-generated executables, like clang, at the boundary between the ro segment and the executable segment.

That sounds reasonable to me. I guess step 3 would consist of teaching pmap_enter() to check for an aligned, populated subrange of the reservation for the page, and promote the entries if so?

Yes. And, as soon as I sent that last message, it occurred to me that pmap_enter() would need to be modified to demote ATTR_CONTIGUOUS mappings, e.g., COW faults.

I think pmap_ts_referenced() will need some scheme for handling ATTR_CONTIGUOUS.

Correct, and ultimately pmap_clear_modify().

P.S. Supposedly Zen/Ryzen does something like ATTR_CONTIGUOUS automatically, but the details on exactly what it does don't appear to be in the public documents, other than it's a 32 KB page size.

In D16303#347551, @alc wrote:

P.S. Supposedly Zen/Ryzen does something like ATTR_CONTIGUOUS automatically, but the details on exactly what it does don't appear to be in the public documents, other than it's a 32 KB page size.

Hmm, do you know of a name or keyword for this feature?

In D16303#347551, @alc wrote:

P.S. Supposedly Zen/Ryzen does something like ATTR_CONTIGUOUS automatically, but the details on exactly what it does don't appear to be in the public documents, other than it's a 32 KB page size.

Hmm, do you know of a name or keyword for this feature?

It's called "PTE Coalescing". I've never found more than a 1-line description in slide decks from AMD, and a mention of being able to count 32 KB page L1 DTLB hits and misses in the open-source documentation for the performance counters. I'm trying to find out more ...