This change adds support for transparent superpages for PowerPC64
systems using Hashed Page Tables (HPT). All pmap operations are
supported.
The changes were inspired by RISC-V implementation of superpages,
by @markj (r344106), but heavily adapted to fit PPC64 HPT architecture
and existing MMU OEA64 code.
While these changes are not better tested, superpages support is disabled by default.
To enable it, use `vm.pmap.superpages_enabled=1`.
In this initial implementation, when superpages are disabled, system performance stays at the same level as without these changes.
When superpages are enabled, buildworld time increases a bit (~2%).
However, for workloads that put a heavy pressure on the TLB the performance boost is much bigger (see HPC Challenge and pgbench below).
Below are the buildworld times of a POWER9 machine (Talos II) with 32GB RAM, with CURRENT kernel (r366072) using GENERIC64 config:
```
* Without D25237:
>>> World built in 7850 seconds, ncpu: 32, make -j32
* With D25237 and vm.pmap.superpages_enabled=0:
>>> World built in 7781 seconds, ncpu: 32, make -j32
~0.9% faster than HEAD
* With D25237 and vm.pmap.superpages_enabled=1:
>>> World built in 7996 seconds, ncpu: 32, make -j32
~1.9% slower than HEAD
~2.8% slower than vm.pmap.superpages_enabled=0
```
Despite the current performance overhead on buildworld when superpages are enabled, some workloads already show a significant performance boost, mainly those that make heavy use of the TLB.
An example is the RandomAccess test from HPC Challenge, that performs several random accesses to a large memory area. With superpages enabled, a 60% boost on a POWER8 machine and 23% on Talos was measured.
Database programs are also said to benefit from superpages. Running pgbench showed about 5% boost on POWER8 and 8.4% on Talos, when taking the average TPS (transactions per second) from 10 select-only runs of 5 seconds (pgbench -S -T 5). When running for several seconds or together with updates, the disk access time ends up dominating and the gains dissipate (pgbench was run on a test database with scale factor 150, with a single thread and client, to minimize other sources of inefficiency, but the size of the database was probably not big enough to take full advantage of superpages).