Page MenuHomeFreeBSD

Implement superpages for PowerPC64 (HPT)
ClosedPublic

Authored by luporl on Jun 11 2020, 6:54 PM.
Referenced Files
Unknown Object (File)
Dec 7 2024, 11:58 PM
Unknown Object (File)
Nov 24 2024, 10:46 PM
Unknown Object (File)
Nov 23 2024, 9:48 PM
Unknown Object (File)
Nov 19 2024, 1:40 AM
Unknown Object (File)
Nov 19 2024, 1:17 AM
Unknown Object (File)
Nov 19 2024, 1:17 AM
Unknown Object (File)
Nov 19 2024, 1:17 AM
Unknown Object (File)
Nov 19 2024, 1:17 AM

Details

Summary

This change adds support for transparent superpages for PowerPC64
systems using Hashed Page Tables (HPT). All pmap operations are
supported.

The changes were inspired by RISC-V implementation of superpages,
by @markj (r344106), but heavily adapted to fit PPC64 HPT architecture
and existing MMU OEA64 code.

While these changes are not better tested, superpages support is disabled by default.
To enable it, use vm.pmap.superpages_enabled=1.

In this initial implementation, when superpages are disabled, system performance stays at the same level as without these changes.
When superpages are enabled, buildworld time increases a bit (~2%).
However, for workloads that put a heavy pressure on the TLB the performance boost is much bigger (see HPC Challenge and pgbench below).
Below are the buildworld times of a POWER9 machine (Talos II) with 32GB RAM, with CURRENT kernel (r366072) using GENERIC64 config:

*   Without D25237:
    >>> World built in 7850 seconds, ncpu: 32, make -j32

*   With D25237 and vm.pmap.superpages_enabled=0:
    >>> World built in 7781 seconds, ncpu: 32, make -j32
    ~0.9% faster than HEAD

*   With D25237 and vm.pmap.superpages_enabled=1:
    >>> World built in 7996 seconds, ncpu: 32, make -j32
    ~1.9% slower than HEAD
    ~2.8% slower than vm.pmap.superpages_enabled=0

Despite the current performance overhead on buildworld when superpages are enabled, some workloads already show a significant performance boost, mainly those that make heavy use of the TLB.
An example is the RandomAccess test from HPC Challenge, that performs several random accesses to a large memory area. With superpages enabled, a 60% boost on a POWER8 machine and 23% on Talos was measured.
Database programs are also said to benefit from superpages. Running pgbench showed about 5% boost on POWER8 and 8.4% on Talos, when taking the average TPS (transactions per second) from 10 select-only runs of 5 seconds (pgbench -S -T 5). When running for several seconds or together with updates, the disk access time ends up dominating and the gains dissipate (pgbench was run on a test database with scale factor 150, with a single thread and client, to minimize other sources of inefficiency, but the size of the database was probably not big enough to take full advantage of superpages).

Test Plan

A test program, along with a test kernel module, can be found at: https://people.freebsd.org/~luporl/sptest
(The Makefile assumes host is an amd64 machine with a cross powerpc64 toolchain installed, although it should be easy to port it build natively)
With the compiled program and module at current directory, running ./test all runs all test cases, that exercises most superpage operations, such as:

  • promotion
  • demotion
  • enter
  • removal
  • unwire
  • protect
  • etc

The test program should be run in single user mode, to avoid interference from other processes.
Also, most test cases expect a file called "mmap" in the current directory, with a size of exactly 16MB.
To create it, run dd if=/dev/zero of=mmap bs=1m count=16.

Besides the test program, stress was used to simulate high memory usage, exercise swap and better test SP REF/CHG bits handling.

The tests above were run both on a QEMU VM, with KVM enabled, and also directly on a POWER8 (previous patch version) and a POWER9 machine.

Finally, buildworld was run with superpages enabled (results above) and no stability issues were noticed.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/powerpc/aim/mmu_oea64.c
287 ↗(On Diff #72993)

Radix (and amd64 at least) already uses vm.pmap.pg_ps_enabled, can you use that instead, for consistency?

1893–1895 ↗(On Diff #72993)

Is a DMAP really a prerequisite for superpages? Or do they just share the same prerequisites?

3203 ↗(On Diff #72993)

I think adding 'void' is only necessary for prototypes, not for definitions.

3479 ↗(On Diff #72993)

Unless there's a statistically significant performance penalty associated with it, this can be unconditional.

sys/powerpc/aim/moea64_native.c
138–145 ↗(On Diff #72993)

This may need to change in the future, to support DRI. We may need to demote to change cache characteristics for an individual DMAP page.

186 ↗(On Diff #72993)

Will this also work on ISA 2.03 and prior? We may want to just do a 'if running on old, use old tlbie, otherwise use new' (or use asm routines to do it all).

186 ↗(On Diff #72991)

The PPC970 supports superpages, as does the POWER4. Now, we probably don't care much about the POWER4, but the PPC970 can benefit from superpages, but as mentioned in the comment right above, it uses a different tlbie instruction format. Can that be worked into this?

sys/powerpc/aim/mmu_oea64.c
287 ↗(On Diff #72993)

Sure.

1893–1895 ↗(On Diff #72993)

I guess they share the same prerequisite: large pages availability.
What if I change !hw_direct_map to moea64_large_page_size == 0?

3203 ↗(On Diff #72993)

Right, I always add 'void' to functions with no arguments, but it's not really necessary in definitions, so I'll remove it.

3479 ↗(On Diff #72993)

Right, buildworld times showed a 1% difference only, when sp_enabled=0, that doesn't seem statistically significant.

sys/powerpc/aim/moea64_native.c
138–145 ↗(On Diff #72993)

Ok, but this would not be a trivial change.
AFAIK, there is no easy way to tell if a PTE belongs to kernel or userspace.
This information could be obtained from the pvo (although currently there is none in moea64_insert_to_pteg_native() , that calls TLBIE),
but there are some complications: TLBIE must match the features (e.g. LPTE_BIG, 16M or 4K/16M) of the old page, but in operations such as insert and replace the pvo describes the new page.

186 ↗(On Diff #72993)

By taking a look at PowerISA 2.03 spec, it initially looks like this wouldn't work on it, but in the end it's ok.

On 2.03, AP is a 1-bit flag telling if the page to be invalidated is 4K or 64K.
But this and section 5.7.5 Virtual Address Generation, when describing Mixed Page Size (MPS), suggest that only a 4K/64K page size mix is supported in a given segment, which would prevent the superpage mechanism from being used on these architectures (unless we add support for 64K superpages). This way, with superpages not supported on ISA 2.03 and earlier, the device tree check for 4K/16M MPS would set it as not available, disabling superpages. Then ap would always be 0 and the old tlbie format would continue to work.

  • Address jhibbits' comments
luporl edited the summary of this revision. (Show Details)
luporl edited the summary of this revision. (Show Details)
  • Add more specific sysctl counters about promotion failures.
  • Add an early check to handle a very common promote fail case, that dramatically reduces the number of promotion failures and improves buildworld performance substantially.
  • Factor out different PPC64 tlbie instruction forms into an ifunc, which also corrects large page invalidation on older cores.

Right now I'm planning to profile the code to try to improve superpages performance, that should at least present some gain over not using it.

sys/powerpc/aim/mmu_oea64.c
1893–1895 ↗(On Diff #72993)

Changed to moea64_large_page_size == 0.

sys/powerpc/aim/moea64_native.c
138–145 ↗(On Diff #72993)

Now using LPTE_KERNEL_VSID_BIT to check if page belongs to kernel or userspace.

186 ↗(On Diff #72993)

Old tlbie instruction format is now on __tlbie_old and new one on __tlbie_new.
Both support invalidating regular and large pages.
(although the code to invalidate a 16M base/16M actual page should be added, if/when needed)

Adding myself to the reviewers list so this stays on my dashboard.

As an intermediate first diff it would be nice to split out the changes to use PVO_PADDR().

Do you have an idea where the slowdown is coming from with pg_ps_enabled=1?

As an intermediate first diff it would be nice to split out the changes to use PVO_PADDR().

Do you have an idea where the slowdown is coming from with pg_ps_enabled=1?

Right, I'll move the PVO_PADDR() changes to another diff.

I wasn't able to profile the code yet, so I don't know where the slowdown is coming from with pg_ps_enabled=1.

Moved PVO_PADDR() changes to another diff (D25654).

sys/powerpc/aim/moea64_native.c
146 ↗(On Diff #74417)

I think 'old' is only needed when 'crop' is needed; otherwise, it's not used. So putting the tlb_old into the 'if (moea64_crop_tlbie)' block, with a goto, and putting the tlbie_new() as the rest of the body, should work, and might knock off a part of the hit, because you'll be removing one level of indirection.

sys/powerpc/include/vmparam.h
192–194 ↗(On Diff #74417)

I think this is unnecessary now.

  • Address jhibbits comments
luporl added inline comments.
sys/powerpc/aim/moea64_native.c
146 ↗(On Diff #74417)

Ok. This change worked fine on Talos, but I didn't notice any change in performance.

sys/powerpc/include/vmparam.h
192–194 ↗(On Diff #74417)

Right, I forgot to remove this part, when removing PPC_SUPERPAGES.

Just for the record, measuring buildworld (as well as kernel and libc builds) times with superpages enabled and disabled didn't show any significant variation.
But when comparing the build times with a kernel that has D25237 vs one without it, there is a consistent performance hit of about 3%.

At the moment my plan is to adapt hwpmc_ppc970 to make it work on POWER8/9 machines, and then use pmcstat to try to find out where the hit is coming from.
But suggestions are welcome!

sys/powerpc/aim/mmu_oea64.c
232 ↗(On Diff #74756)

There seems to be a mix of "lp" and "sp" to denote "superpage". I think we should pick one and stick with it.

267 ↗(On Diff #74756)

What about protection bits? Oh I see, those are checked separately. It might be worth adding a sentence explaining that.

Why is PVO_LARGE included?

281 ↗(On Diff #74756)

pg_ps_enabled is kind of a misnomer: "ps" refers to PG_PS, which is the name of the bit in x86 page tables that indicates that a PDE or PDPE maps a superpage. Other arches just call it superpages_enabled.

1678 ↗(On Diff #74756)

Why do we not promote superpages in the kernel map? Most kernel mappings are unmanaged, so the check below will catch them, but exec_map and pipe_map create managed mappings that can in principle be promoted.

3688 ↗(On Diff #74756)

In other implementations, this check is in the caller. Why not here?

I would check moea64_ps_enabled() earlier, so as not to pessimize systems where superpages are disabled, since vm_reserv_level_iffullpop() is more expensive.

3789 ↗(On Diff #74756)

Ouch, won't this be quite expensive? Do you actually need to invalidate the small mappings? This is one area where the various superpage implementations differ. ARM has an annoying requirement that a physical page not be aliased by both 4KB and large mappings, so promotions require a "break-before-make", in which the 4KB mappings are cleared (by clearing the L2 entry), TLBs are invalidated, and then the large mapping is created. See pmap_update_entry() on arm64 for example. amd64, i386 and riscv do not have this requirement and therefore do not perform TLB invalidation during promotion. I am not sure about the requirements on POWER, but it would be better to avoid or batch these invalidations if possible.

3808 ↗(On Diff #74756)

The return value is unused.

sys/powerpc/aim/mmu_oea64.c
3789 ↗(On Diff #74756)

(Note: IBM differentiates between "Effective Address" and "Virtual Address" -- VA is an (up to) 78 bit address that is generally unique (although I suppose if two processes were to share the same VSID at any point, they could theoretically have 256M or 1T chunks of shared address space), EA is the normal 64 bit address that is the address that is stored in pointers and such)

There is no L2 entry, HPT is not a radix table, it's a segment + hash scheme, where the EA is converted into a VA by using the (32-entry) SLB to look up a VSID (or taking a segment fault, or using a process scoped segment table if we were to set one up), and then using that VSID and the page bits of the EA (as per the L,LP bits in the SLBE, for FreeBSD we can assume 4k base pages) to get the VPN, and then using that VPN as a hash key into the pagetable to identify the address of the correct PTEG to search. (As well as possibly doing a secondary search to come up with a second PTEG to scan)

The 8 records of the PTEG (and 8 of the secondary if enabled) are then matched against the properties of the SLBE to look for a match that satisfies the requirements.

It is vitally important that there not be overlapping matches in VA space because that can lead to multihits in the TLB if two different page sizes are observed at any point. This causes a machine check, which brings down the whole machine.

So yes, it is necessary to do a bunch of invalidation as far as I know.

See section 5.7 in Book III of the ISA. (Preferably 3.0) -- 5.7.7 is the main description of address translation.

FWIW, I wonder if we could optimize the pvo_entry structure by not tracking the entire address of the PTE slot and instead reconstructing it by running the hash function given the VSID.

sys/powerpc/aim/mmu_oea64.c
3789 ↗(On Diff #74756)

(that is, only tracking the offset into the PTEG and calculating the PTEG base address when we need it)

sys/powerpc/aim/mmu_oea64.c
3789 ↗(On Diff #74756)

Additional note: the main benefit of large pages on HPT is that the *tlb hit rate* goes up. You still need a PTE for every 4k of it if you want to avoid page faults, but a match will get cached in the TLB / ERATs as covering the entire range.

sys/powerpc/aim/mmu_oea64.c
232 ↗(On Diff #74756)

Yeah, I've used "sp" (superpage) during most of the changes. The idea was to call the parts that deal with the superpage feature (like promotion and demotion) "sp", to match how other arch's call them, but most places in PPC code use "lp" (large page) when referring to a page whose size is of several 4k pages, although currently PPC64 uses large pages for DMAP only.

I've used "lp" here because this is related to a PPC-specific hardware feature and closely related to others already named as large page.

For the rest of the code, changing "lp" and "large_page" to "sp" and "superpage" doesn't seem a good idea to me. I would also prefer to leave the parts that deal with promotion, demotion, etc. as superpages, but I guess it would be ok to rename them to "large_pages" too.
What do you think?

267 ↗(On Diff #74756)

Right, protection bits are in another PVO field and are checked separately. I'll add a comment about it, and maybe rename PVO_PROMOTE to PVO_FLAGS_PROMOTE, to make it clear.

PVO_LARGE was added to make sure all pages are of the same size (4K), but I guess it is not necessary to check it during promotion, because if any non-4K page is found during promotion, it means the logic is really broken.

281 ↗(On Diff #74756)

Ah, ok, thanks for the explanation, I'll change it.

1678 ↗(On Diff #74756)

I was afraid that using superpages in kernel map could make things more complex and cause some breaks and also wasn't sure about the benefits.
But I can give it a try now and see how things go.

3688 ↗(On Diff #74756)

No special reason. I'll move the check to the caller.

Nice tip, I'll check moea64_ps_enabled() earlier.

3789 ↗(On Diff #74756)

As bdragon explained, we really need to invalidate all mappings.
We don't have a batched invalidation mechanism on PPC yet, but it could be a nice addition.

I've tried to optimize the PVO tree by using only one PVO to track a superpage, but I started facing some issues and gave up at that moment, to make everything work correctly first, before trying more optimizations.
From what I remember, the main issue was caused by not saving the PVO_HID bit for each 4K PTE. IIUC, we can't guess what it will be for a given 4K range within a superpage, because it is set when we fail to insert the PTE in the primary hash table and use the second one instead. But maybe we can save only this bit for each PTE, in a PVO bitmap, and make the idea of using only one PVO for superpage work.

3808 ↗(On Diff #74756)

Right, I'll remove it.

  • address (part of) markj's comments

The plan now is to (finally) profile code changes with the now working pmcstat tool and improve performance, to put it (at least) at the same level of HEAD without these changes.

luporl added inline comments.
sys/powerpc/aim/mmu_oea64.c
1678 ↗(On Diff #74756)

As this can have a big impact in performance and stability, I'll try it after improving current changes' performance.

Fix ERAT multi-hit issue

pte_replace() can't be used in promote/demote, as other threads
may access the region of memory that is being promoted/demoted
at the same time, causing an ERAT multi-hit HMI to be delivered,
because an access is performed while there is a mix of 4K/16M
pages for the same memory region.
Instead of replacing each subpage in a loop, use
pte_replace_sp(), that first removes all subpages and then
inserts their replacements.

This diff also prevents superpages from being evicted during PTE
insert, which could cause an ERAT multi-hit too.

luporl added inline comments.
sys/powerpc/aim/mmu_oea64.c
3789 ↗(On Diff #74756)

The last diff implements batch PTE unset and insert. TLB invalidations are performed during PTE unset.

It would be possible to perform all TLB invalidations as a separate step, but it would require more work and I'm don't know how much it would improve performance, compared to the current loop that removes the PTE and invalidates its TLB entry, since on POWER9 (and probably on POWER8 too) TLBIE don't need locks.

luporl edited the summary of this revision. (Show Details)
luporl edited the test plan for this revision. (Show Details)
  • improve performance
    • force inline in new moea64_pte_* functions to avoid extra overhead with clang -O now defaulting -O1
    • look for superpage PVO in moea64_enter only when superpages are enabled
    • use multiple PV locks for superpage ops, instead of making a single PV lock cover a superpage
  • rename SP_ to HPT_SP_ to avoids conflicts with amdgpu kernel module

Now, by default (vm.pmap.superpages_enabled=0), system performance is at the same level as without superpages support (updated times in summary).

When superpages are enabled (vm.pmap.superpages_enabled=1), buildworld time increases a bit (~2%).
However, for workloads that put a heavy pressure on the TLB the performance boost is much bigger (e.g. HPC Challenge and pgbench in summary).

Looks good, glad you got the speed back!

sys/powerpc/aim/moea64_native.c
162 ↗(On Diff #78646)

Can you keep this comment in, to note why this silly mess is here?

Restore comment about tlbie instruction forms

luporl added inline comments.
sys/powerpc/aim/moea64_native.c
162 ↗(On Diff #78646)

Sure.

Maybe after it's in, someone can try to figure out why enabling superpages causes a 2% performance penalty in buildworld.

This revision is now accepted and ready to land.Nov 6 2020, 5:15 AM
This revision was automatically updated to reflect the committed changes.
luporl marked an inline comment as done.