Page MenuHomeFreeBSD

PTI for amd64.

Authored by kib on Jan 8 2018, 3:44 PM.



The implementation of the Kernel Page Table Isolation (KPTI) for amd64, first version. It provides a workaround for the 'meltdown' vulnerability.

The pmap page table is split into kernel-mode table and user-mode table. Kernel-mode table is identical to the non-PTI table, while usermode table is obtained from kernel table by leaving userspace mappings intact, but only leaving the following parts of the kernel mapped:

  • kernel text (but not modules text)
  • PCPU
  • GDT/IDT/task structures
  • IST stacks for NMI and doublefault handlers.

Kernel switches to user page table before returning to usermode, and restores full kernel page table on the entry. Initial kernel-mode stack for PTI trampoline is allocated in PCPU, it is only 16 qwords. Kernel entry trampoline switches page tables. then the hardware trap frame is copied to the normal kstack, and execution continues.

IST stacks are kept mapped and no trampoline is needed for NMI/doublefault, but of course page table switch is performed.

On return to usermode, the trampoline is used again, iret frame is copied to the trampoline stack, page tables are switched and iretq is executed. The case of iretq faulting due to the invalid usermode context is tricky, since the frame for fault is appended to the trampoline frame. Besides copying the fault frame and original (corrupted) frame to kstack, the fault frame must be patched to make it look as if the fault occured on the kstack, see the comment in doret_iret detection code in trap().

Currently kernel pages which are mapped during trampoline operation are identical for all pmaps. They are registered using pmap_pti_add_kva(). Besides initial registrations done during boot, LDT and non-common TSS segments are registered if user requested their use. In principle, they can be installed into kernel page table per pmap with some work. Similarly, PCPU can be hidden from userspace mapping using trampoline PCPU page, but again I do not see much benefits besides complexity.

PDPE pages are pre-allocated during boot because we need to know pml4 entries which are copied to the top-level paging structure page, in advance on a new pmap creation. I enforce this to avoid iterating over the all existing pmaps if a new PDPE page is needed for PTI kernel mappings. The iteration is a known problematic operation on i386.

The need to flush hidden kernel translations on the switch to user mode make global tables (PG_G) meaningless and even harming, so PG_G use is disabled for PTI case. Our existing use of PCID is incompatible with PTI and is automatically disabled if PTI is enabled. PCID can be forced on only for developer's benefit.

MCE is known to be broken, it requires IST stack to operate completely correctly even for non-PTI case, and absolutely needs dedicated IST stack because MCE delivery while trampoline did not switched from PTI stack is fatal. The fix is pending.

Test Plan

Patch survived pho' stress testing.

Diff Detail

Lint Skipped
Unit Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

Can we get a high level description of these changes so we're not having to infer details from code?


Are we not losing this alignment in your new version?


I would find this whole section more readable with a single 'gflag' variable and a single test for pti;

ie; gflag = pti ? 0 : X86_PG_G;

pt_p[i] = .... | gflag;

kib marked 2 inline comments as done.

pti ? PG_G : 0 -> pg_g
realign both pti and normal interrupt entries
remove remnants of the failed PCID experiment

Do commit the realignment of the interrupt entries.

I've now reproduced the issues on vanilla FreeBSD:

# make -sj6 buildkernel
make[1]: "/usr/src/Makefile.inc1" line 155: SYSTEM_COMPILER: Determined that CC=cc matches the source tree.  Not bootstrapping a cross-compiler.
>>> Kernel build for GENERIC started on Tue Jan  9 03:02:49 EST 2018

>>> stage 1: configuring the kernel                                                                                                                                               
Kernel build directory is /usr/obj/usr/src/amd64.amd64/sys/GENERIC
Don't forget to do ``make cleandepend && make depend''

>>> stage 2.1: cleaning up the object tree
rm: /usr/obj/usr/src/amd64.amd64/sys/GENERIC/modules/usr/src/sys/modules/drm2/radeonkmsfw/R600_me/machine: No child processes
rm: /usr/obj/usr/src/amd64.amd64/sys/GENERIC/modules/usr/src/sys/modules/drm2/radeonkmsfw/R600_me: Directory not empty


​        .macro        INTR_HANDLER vec_name
	​        .text
	​        PTI_INTRENTRY        \vec_name
	​        INTR_PUSH_FRAME        \vec_name
	​        .endm


.macro        INTR_PUSH_FRAME vec_name

I generated an installer image (memstick.img) with the latest PTI patch to test a fresh installation of vanilla FreeBSD with the PTI patch applied. Extracting the distsets failed.

Attempting to run with vm.pmap.pti=0 set results in this kernel panic in early boot:

Just providing a test case:

On a 12.0-CURRENT r327750 with the latest (37676) patch applied, I can get random failures by running find driving md5 on, for example, /usr/src:

# while true; do find /usr/src/ -type f -print0 | xargs -0 -P 2 md5 > /dev/null; done
xargs: md5: terminated with signal 11; aborting
md5: /usr/src/sys/arm/include/stdarg.h: Bad file descriptor
md5: /usr/src/.svn/pristine/36/36e5ccf4b4cd53a38f248b91193b0d425bbf069e.svn-base: Bad file descriptor
find: /usr/src/contrib/dialog/samples/programbox2: Bad file descriptor
md5: /usr/src/contrib/compiler-rt/lib/builtins/umoddi3.c: Bad file descriptor
md5: /usr/src/contrib/llvm/lib/Target/SystemZ/ Invalid argument
md5: /usr/src/contrib/atf/atf-c++/detail/Kyuafile: Bad file descriptor
find: /usr/src/share/timedef/sv_SE.UTF-8.src: Bad file descriptor
find: /usr/src/lib/libc/stdio/tmpnam.3: Bad file descriptor
md5: /usr/src/sbin/hastd/hast.h: File name too long

If I run with fewer CPUs (above was with 4; host has 2 cores + HT) it hits errors easier with lower numbers of threads. (With 2 or fewer CPUs, it hits errors fairly quickly with -P 1)... Experienced this first on ZFS, and then reproduced on UFS.

Not providing any fix here, but sometimes a simple reproducer can be useful.

On the non-patched kernel, I can run the same VM with 1-4 cores and -P 0 rock solid (as one would hope); with a load ~ 4-13 depending on # of cores.

Thanks for the hard work on this!

My FreeBSD bhyve VM with the patch applied can boot with vm.pmap.pti set to 0. However, even with it set to 0, I get weird runtime errors. Like this when running make -sj6 buildkernel:

LLVM ERROR: IO failure on output stream.
--- ar9300_eeprom.o ---       
*** [ar9300_eeprom.o] Error code 1

On syscall entry, evacuate saved %rax from pcpu before allowing context switches.
Fill tss_rsp0 on context switch when on non-default tss and pti is enabled.
Correct the invalidated range in pmap_pti_remove_kva().

mmokhi added inline comments.

*style* ( 😅 ): Is it accidentally left here by your editor?

Just providing a test case:

On a 12.0-CURRENT r327750 with the latest (37676) patch applied, I can get random failures by running find driving md5 on, for example, /usr/src:

Just following up -- I'm not getting errors anymore with from this simple test on 37740.


This is intended. Otherwise emacs gets too confused about native as macros.

This passed a buildworld + buildkernel smoketest on my AMD Threadripper system. I didn't pay very close attention to the time to complete. It's somewhat slower compared to an unpatched kernel.


Any particular reason for disabling PCID by default if PTI is enabled?


Because they do not work together. I plan to work on this after the first patch is finalized.


Aha, I see.
Thanks for explanation 🙏

Snapshot today' version of the patch containing assorted fixes.

Most important, the doublefault stack is mapped into PTI userspace.

NMI still does not work. Also apparently MCE is broken.

Correct PTI mapping for GDT.

Fix PTI mappings for ist stacks. NMI, and particularly, pmc should work better now.

kib edited the test plan for this revision. (Show Details)
kib edited the summary of this revision. (Show Details)
kib edited the summary of this revision. (Show Details)

Do you plan to backport the fix for 11-STABLE too? If so, do you plan to MFC all of the vmmeter refactor - which moves vmmeter counters into separate counter_t from PCPU - or with extending the PCPU structure?

In D13797#291293, @op wrote:

Do you plan to backport the fix for 11-STABLE too? If so, do you plan to MFC all of the vmmeter refactor - which moves vmmeter counters into separate counter_t from PCPU - or with extending the PCPU structure?

The patch will be merged to 11, I do not plan to merge convertion of vmmeter for backport.

This comment brought to you by my HardenedBSD laptop running with your PTI patch and the retpoline patch from llvm. I'm happy to report that everything is working fine, albeit with a noticeable lag in some cases. I have multiple bhyve VMs running in parallel, including Win10. Thank you for fixing that! I did have to rebuild the nvidia-driver port, but that's to be expected. It's probably safe to assume that a good portion of third-party kernel modules will need to be rebuilt.

Got a panic, potentially related to bhyve. I've posted the core txt here:

It happened shortly after shutting down a bhyve VM. Host is HardenedBSD, the hardened/current/retpoline branch in our playground repo.

Fix problems with EPT pmaps.

Reported by: pho

Tinderbox fixes.
Patch passed pho testing.

markj added inline comments.

create_pagetables() references pg_g before we set it here.


Don't we still need this block? What if PTI is disabled and the PCID feature flag is not set?

kib marked 2 inline comments as done.Jan 16 2018, 6:04 PM
kib added inline comments.

You are right. I do not have non-pcid machines for development.

kib marked an inline comment as done.

Fix Mark' findings.


I can't see why we don't use VM_ALLOC_{ZERO,WIRED} here as we do for the "global" PML4 page.


It seems the latest rev no longer needs the CPU 0 nmi and dblfault stack references. In particular, the symbols can remain local to machdep.c.


I believe you need to unwire the page before freeing here and at the other levels.

kib marked 2 inline comments as done.Jan 16 2018, 7:56 PM
kib added inline comments.

ZERO would be a waste since I copy complete pti_pml4 page of pml4e's into the usermode pml4 page.

For WIRED, I tihink it is only important to account for the page use in v_wire_count. There it seems that the decrements were missed at all for all pti pagetable pages.

Try to fix unwire on pti page table page free.
Re-staticise nmi0 and doublefault stacks for BSP.


This line should be deleted.


It seems this change is not needed.

This revision is now accepted and ready to land.Jan 17 2018, 6:32 PM