Page MenuHomeFreeBSD

amd64 pmap: LA57 AKA 5-level paging

Authored by kib on Jun 14 2020, 10:01 PM.



Since LA57 was moved to the main SDM document with revision 072, it seems that we should have a support for it, and silicons are coming.

This patch makes pmap support both LA48 and LA57 hardware. The selection of page table level is done at startup, kernel always receives control from loader with 4-level paging. It is not clear how UEFI spec would adapt LA57, for instance it could hand out control in LA57 mode sometimes.

To switch from LA48 to LA57 requires turning off long mode, requesting LA57 in CR4, then re-entering long mode. This is somewhat delicate and done in pmap_bootstrap_la57(). AP startup in LA57 mode is much easier, we only need to toggle a bit in CR4 and load right value in CR3.

I decided to not change kernel map for now. Single PML5 entry is created that points to the existing kernel_pml4 (KML4Phys) page, and a pml5 entry to create our recursive mapping for vtopte()/vtopde(). This decision is motivated by the fact that we cannot overcommit for KVA, so large space there is unusable until machines start providing wider physical memory addressing. Another reason is that I do not want to break our fragile autotuning, so the KVA expansion is not included into this first step.

On the other hand, (very) large address space is definitely immediately useful for some userspace applications.

For userspace, numbering of pte entries (or page table pages) is always done for 5-level structures even if we operate in 4-level mode. The pmap_is_la57() function is added to report the mode of the specified pmap, this is done not to allow simultaneous 4-/5-levels (which is not allowed by hw), but to accomodate for EPT which has separate level control and in principle might not allow 5-leve EPT despite x86 paging supports it. Anyway, it does not seems critical to have 5-level EPT support now.

elfcontrol and proccontrol allow to request or disable LA57 for specific binary, for ABI compat.

Bhyve, efirt, suspend/resume, and large map are adapted to LA57 but not tested.

PID              START                END PRT  RES PRES REF SHD FLAG TP PATH
 17           0x400000           0x426000 r-x   38   39   1   0 CN-- vn /bin/sh
 17           0x626000           0x629000 rw-    3    3   1   0 C--- df 
 17        0x800626000        0x800648000 r-x   34   36   2   0 CN-- vn /libexec/
 17        0x800648000        0x80066b000 rw-   28   28   1   0 C--- df 
 17        0x80066b000        0x80066c000 r--    1    1   3   0 ---- dv 
 17        0x80066c000        0x800706000 rw-   50   50   1   0 C--- df 
 17        0x800848000        0x80084a000 rw-    2    2   1   0 CN-- df 
 17        0x80084a000        0x80087e000 r-x   52   55   2   0 CN-- vn /lib/
 17        0x80087e000        0x800a7e000 ---    0    0   0   0 CN-- -- 
 17        0x800a7e000        0x800a80000 rw-    2    0   1   0 CN-- vn /lib/
 17        0x800a80000        0x800a84000 rw-    1    1   1   0 CN-- df 
 17        0x800a84000        0x800c4f000 r-x  355  384   4   0 CN-- vn /lib/
 17        0x800c4f000        0x800e4e000 ---    0    0   0   0 CN-- -- 
 17        0x800e4e000        0x800e5d000 rw-   15    0   1   0 CN-- vn /lib/
 17        0x800e5d000        0x801087000 rw-   17   17   1   0 CN-- df 
 17        0x801087000        0x8010e0000 r-x   89   94   2   0 CN-- vn /lib/
 17        0x8010e0000        0x8012df000 ---    0    0   0   0 CN-- -- 
 17        0x8012df000        0x8012e5000 rw-    6    0   1   0 CN-- vn /lib/
 17        0x8012e5000        0x8018e5000 rw-   12   12   1   0 CN-- df 
 17   0xffffffdffff000   0xfffffffffdf000 ---    0    0   0   0 ---- -- 
 17   0xfffffffffdf000   0xfffffffffff000 rw-    6    6   1   0 C--D df 
 17   0xfffffffffff000  0x100000000000000 r-x    1    1   4   0 ---- ph

Tested by: pho (LA48 hw)

Diff Detail

rS FreeBSD src repository - subversion
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

kib requested review of this revision.Jun 14 2020, 10:01 PM
kib edited the summary of this revision. (Show Details)
kib removed a reviewer: gnn.

Handle wakeup.
Update description of ptepindex.

bhyve: Handle guest' LA57 paging mode.
This is in fact independent of the rest of the patch, since guest can set the bit in %cr4 at will.

Noted by: grehan

147 ↗(On Diff #74014)

to me _checker seems a bit unclear. IMO boolean functions should answer a question, e.g. _is_la57 or _la57_supported or _la57_wanted or such as appropriate

799 ↗(On Diff #74014)

I assume that similar changes will come to other archs, e.g. for Arm 48 / 52. If we're going to offer similar control there is there a more MI name we could use that's applicable everywhere (even if in the Arm case LA48 would still apply)? I don't have a great idea though; things incorporating "smaller" or "legacy" or whatnot are all relative to something else, and a term that stands alone is preferable.

147 ↗(On Diff #74014)

freebsd_brand_info_la57_img_compat ?

799 ↗(On Diff #74014)

Are you referring to ARM 8.2 'large VA' ? From what I remember, they do it by increasing page size to 64k (or doing something that is equivalent to that). I doubt that we ever would support such page size on arm64.

I was not able to find an extension in up to 8.6 that would increased the page table levels.

Still, if you have a proposal to rename the bit, I will apply it of course. I cannot propose anything better than LA_GEN1.

147 ↗(On Diff #74014)

sounds good

799 ↗(On Diff #74014)

Ah, yes, so Intel only for the time being.

LA_GEN1 is a fine name if we think it will indeed become MI in the future, but probably unnecessary.

kib marked 2 inline comments as done.Jul 2 2020, 2:51 PM
1872 ↗(On Diff #74014)

The same change is needed in usr.sbin/bhyve/gdb.c:guest_paging_info()

kib marked an inline comment as done.

Handle bhyve/gdb.c

bhyve bits look fine.

Fix initialization of sv_sigcode_base/sv_timekeep_base for LA48 sv sysent.

I intend to commit this during the weekend, regardless of the review status.

N.B. If somebody wants to do some limited look at the patch, most delicate place is perhaps the changes to _pmap_allocpte() and calculation of pindexes.

I haven't had the time to do a careful, line-by-line review, but at a high level the approach looks okay.

152–153 ↗(On Diff #74014)

"Turn on the PAE bit and optionally the LA57 bit for ... is later enabled."

193 ↗(On Diff #76104)

Drop the comma.

216 ↗(On Diff #76104)

Drop the space after the cast.

231 ↗(On Diff #76104)

Drop the space after the cast.

93 ↗(On Diff #76104)

See earlier comment.

138 ↗(On Diff #76104)

Drop the final "page" from this sentence.

140 ↗(On Diff #76104)

"So, we use yet another ..."

2060 ↗(On Diff #76104)

"temporal" -> "temporary"

2077 ↗(On Diff #76104)

"mapping" -> "mappings"

3984 ↗(On Diff #76104)

"mapping" -> "mappings"

4018 ↗(On Diff #76104)

"mapping" -> "mappings"

No comma at the end

4186 ↗(On Diff #76104)

Couldn't this be up to 3 now?

4214 ↗(On Diff #76104)

"fixed" -> "determined"

4215 ↗(On Diff #76104)

"mode. Moreover, the ..."

199 ↗(On Diff #76104)

"We use the same numbering ...

75–76 ↗(On Diff #76104)

"to 48 bits of address,"

"of 57-bit addressing"

kib marked 18 inline comments as done.

Handle notes by Alan.

This revision was not accepted when it landed; it landed in state Needs Review.Aug 23 2020, 7:44 PM
This revision was automatically updated to reflect the committed changes.