PTI for amd64.
ClosedPublic

Authored by kib on Jan 8 2018, 3:44 PM.

Details

Reviewers
kib
Summary

The implementation of the Kernel Page Table Isolation (KPTI) for amd64, first version. It provides a workaround for the 'meltdown' vulnerability.

The pmap page table is split into kernel-mode table and user-mode table. Kernel-mode table is identical to the non-PTI table, while usermode table is obtained from kernel table by leaving userspace mappings intact, but only leaving the following parts of the kernel mapped:

  • kernel text (but not modules text)
  • PCPU
  • GDT/IDT/task structures
  • IST stacks for NMI and doublefault handlers.

Kernel switches to user page table before returning to usermode, and restores full kernel page table on the entry. Initial kernel-mode stack for PTI trampoline is allocated in PCPU, it is only 16 qwords. Kernel entry trampoline switches page tables. then the hardware trap frame is copied to the normal kstack, and execution continues.

IST stacks are kept mapped and no trampoline is needed for NMI/doublefault, but of course page table switch is performed.

On return to usermode, the trampoline is used again, iret frame is copied to the trampoline stack, page tables are switched and iretq is executed. The case of iretq faulting due to the invalid usermode context is tricky, since the frame for fault is appended to the trampoline frame. Besides copying the fault frame and original (corrupted) frame to kstack, the fault frame must be patched to make it look as if the fault occured on the kstack, see the comment in doret_iret detection code in trap().

Currently kernel pages which are mapped during trampoline operation are identical for all pmaps. They are registered using pmap_pti_add_kva(). Besides initial registrations done during boot, LDT and non-common TSS segments are registered if user requested their use. In principle, they can be installed into kernel page table per pmap with some work. Similarly, PCPU can be hidden from userspace mapping using trampoline PCPU page, but again I do not see much benefits besides complexity.

PDPE pages are pre-allocated during boot because we need to know pml4 entries which are copied to the top-level paging structure page, in advance on a new pmap creation. I enforce this to avoid iterating over the all existing pmaps if a new PDPE page is needed for PTI kernel mappings. The iteration is a known problematic operation on i386.

The need to flush hidden kernel translations on the switch to user mode make global tables (PG_G) meaningless and even harming, so PG_G use is disabled for PTI case. Our existing use of PCID is incompatible with PTI and is automatically disabled if PTI is enabled. PCID can be forced on only for developer's benefit.

MCE is known to be broken, it requires IST stack to operate completely correctly even for non-PTI case, and absolutely needs dedicated IST stack because MCE delivery while trampoline did not switched from PTI stack is fatal. The fix is pending.

Test Plan

Patch survived pho' stress testing.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped
There are a very large number of changes, so older changes are hidden. Show Older Changes
jeff added a subscriber: jeff.Jan 9 2018, 12:41 AM

Can we get a high level description of these changes so we're not having to infer details from code?

jeff added inline comments.Jan 9 2018, 12:54 AM
sys/amd64/amd64/apic_vector.S
128

Are we not losing this alignment in your new version?

sys/amd64/amd64/pmap.c
945

I would find this whole section more readable with a single 'gflag' variable and a single test for pti;

ie; gflag = pti ? 0 : X86_PG_G;

pt_p[i] = .... | gflag;

kib updated this revision to Diff 37675.Jan 9 2018, 10:02 AM
kib marked 2 inline comments as done.

pti ? PG_G : 0 -> pg_g
realign both pti and normal interrupt entries
remove remnants of the failed PCID experiment

kib updated this revision to Diff 37676.Jan 9 2018, 10:39 AM

Do commit the realignment of the interrupt entries.

emaste added a subscriber: emaste.Jan 9 2018, 12:44 PM

I've now reproduced the issues on vanilla FreeBSD:

# make -sj6 buildkernel
make[1]: "/usr/src/Makefile.inc1" line 155: SYSTEM_COMPILER: Determined that CC=cc matches the source tree.  Not bootstrapping a cross-compiler.
                   
--------------------------------------------------------------
>>> Kernel build for GENERIC started on Tue Jan  9 03:02:49 EST 2018
--------------------------------------------------------------
===> GENERIC

--------------------------------------------------------------
>>> stage 1: configuring the kernel                                                                                                                                               
--------------------------------------------------------------
Kernel build directory is /usr/obj/usr/src/amd64.amd64/sys/GENERIC
Don't forget to do ``make cleandepend && make depend''

--------------------------------------------------------------
>>> stage 2.1: cleaning up the object tree
--------------------------------------------------------------
rm: /usr/obj/usr/src/amd64.amd64/sys/GENERIC/modules/usr/src/sys/modules/drm2/radeonkmsfw/R600_me/machine: No child processes
rm: /usr/obj/usr/src/amd64.amd64/sys/GENERIC/modules/usr/src/sys/modules/drm2/radeonkmsfw/R600_me: Directory not empty
emaste added inline comments.Jan 9 2018, 2:25 PM
sys/amd64/amd64/apic_vector.S
128

INTR_HANDLER is

​        .macro        INTR_HANDLER vec_name
	​        .text
	​        PTI_INTRENTRY        \vec_name
	​        INTR_PUSH_FRAME        \vec_name
	​        .endm

and INTR_PUSH_FRAME is:

.macro        INTR_PUSH_FRAME vec_name
SUPERALIGN_TEXT
...

I generated an installer image (memstick.img) with the latest PTI patch to test a fresh installation of vanilla FreeBSD with the PTI patch applied. Extracting the distsets failed.

Attempting to run with vm.pmap.pti=0 set results in this kernel panic in early boot: https://photos.app.goo.gl/bm29U8ChZDAmnF0B3

seanc added a subscriber: seanc.Jan 10 2018, 12:25 AM

Just providing a test case:

On a 12.0-CURRENT r327750 with the latest (37676) patch applied, I can get random failures by running find driving md5 on, for example, /usr/src:

# while true; do find /usr/src/ -type f -print0 | xargs -0 -P 2 md5 > /dev/null; done
xargs: md5: terminated with signal 11; aborting
md5: /usr/src/sys/arm/include/stdarg.h: Bad file descriptor
md5: /usr/src/.svn/pristine/36/36e5ccf4b4cd53a38f248b91193b0d425bbf069e.svn-base: Bad file descriptor
find: /usr/src/contrib/dialog/samples/programbox2: Bad file descriptor
md5: /usr/src/contrib/compiler-rt/lib/builtins/umoddi3.c: Bad file descriptor
md5: /usr/src/contrib/llvm/lib/Target/SystemZ/SystemZInstrVector.td: Invalid argument
md5: /usr/src/contrib/atf/atf-c++/detail/Kyuafile: Bad file descriptor
find: /usr/src/share/timedef/sv_SE.UTF-8.src: Bad file descriptor
find: /usr/src/lib/libc/stdio/tmpnam.3: Bad file descriptor
md5: /usr/src/sbin/hastd/hast.h: File name too long

If I run with fewer CPUs (above was with 4; host has 2 cores + HT) it hits errors easier with lower numbers of threads. (With 2 or fewer CPUs, it hits errors fairly quickly with -P 1)... Experienced this first on ZFS, and then reproduced on UFS.

Not providing any fix here, but sometimes a simple reproducer can be useful.

On the non-patched kernel, I can run the same VM with 1-4 cores and -P 0 rock solid (as one would hope); with a load ~ 4-13 depending on # of cores.

Thanks for the hard work on this!

My FreeBSD bhyve VM with the patch applied can boot with vm.pmap.pti set to 0. However, even with it set to 0, I get weird runtime errors. Like this when running make -sj6 buildkernel:

LLVM ERROR: IO failure on output stream.
--- ar9300_eeprom.o ---       
*** [ar9300_eeprom.o] Error code 1
kib updated this revision to Diff 37740.Jan 10 2018, 7:28 PM

On syscall entry, evacuate saved %rax from pcpu before allowing context switches.
Fill tss_rsp0 on context switch when on non-default tss and pti is enabled.
Correct the invalidated range in pmap_pti_remove_kva().

mmokhi added a subscriber: mmokhi.Jan 10 2018, 10:14 PM
mmokhi added inline comments.
sys/amd64/include/asmacros.h
2

*style* ( 😅 ): Is it accidentally left here by your editor?

Just providing a test case:

On a 12.0-CURRENT r327750 with the latest (37676) patch applied, I can get random failures by running find driving md5 on, for example, /usr/src:

Just following up -- I'm not getting errors anymore with from this simple test on 37740.

kib added inline comments.Jan 10 2018, 11:06 PM
sys/amd64/include/asmacros.h
2

This is intended. Otherwise emacs gets too confused about native as macros.

This passed a buildworld + buildkernel smoketest on my AMD Threadripper system. I didn't pay very close attention to the time to complete. It's somewhat slower compared to an unpatched kernel.

danilo added a subscriber: danilo.Jan 11 2018, 3:26 AM
sys/amd64/amd64/pmap.c
1099

Any particular reason for disabling PCID by default if PTI is enabled?

kib updated this revision to Diff 37794.Jan 11 2018, 1:25 PM

Re-merge the branch.

kib added inline comments.Jan 11 2018, 1:29 PM
sys/amd64/amd64/pmap.c
1099

Because they do not work together. I plan to work on this after the first patch is finalized.

mmokhi added inline comments.Jan 12 2018, 2:52 PM
sys/amd64/include/asmacros.h
2

Aha, I see.
Thanks for explanation 🙏

stevek added a subscriber: stevek.Jan 12 2018, 5:38 PM
kib updated this revision to Diff 37885.Jan 12 2018, 8:09 PM

Snapshot today' version of the patch containing assorted fixes.

Most important, the doublefault stack is mapped into PTI userspace.

NMI still does not work. Also apparently MCE is broken.

kib updated this revision to Diff 37890.Jan 12 2018, 10:17 PM

Correct PTI mapping for GDT.

kib updated this revision to Diff 37912.Jan 13 2018, 12:27 PM

Fix PTI mappings for ist stacks. NMI, and particularly, pmc should work better now.

kib updated this revision to Diff 37913.Jan 13 2018, 12:58 PM

Add FF copyright.

kib edited the summary of this revision. (Show Details)Jan 13 2018, 3:30 PM
kib edited the test plan for this revision. (Show Details)
kib edited the summary of this revision. (Show Details)Jan 13 2018, 3:33 PM
kib edited the summary of this revision. (Show Details)
kib edited the summary of this revision. (Show Details)Jan 13 2018, 3:36 PM
kib edited the summary of this revision. (Show Details)
kib updated this revision to Diff 37914.Jan 13 2018, 3:53 PM

Simplify init.

op added a comment.Jan 13 2018, 4:06 PM

Do you plan to backport the fix for 11-STABLE too? If so, do you plan to MFC all of the vmmeter refactor - which moves vmmeter counters into separate counter_t from PCPU - or with extending the PCPU structure?

kib added a comment.Jan 13 2018, 4:12 PM
In D13797#291293, @op wrote:

Do you plan to backport the fix for 11-STABLE too? If so, do you plan to MFC all of the vmmeter refactor - which moves vmmeter counters into separate counter_t from PCPU - or with extending the PCPU structure?

The patch will be merged to 11, I do not plan to merge convertion of vmmeter for backport.

This comment brought to you by my HardenedBSD laptop running with your PTI patch and the retpoline patch from llvm. I'm happy to report that everything is working fine, albeit with a noticeable lag in some cases. I have multiple bhyve VMs running in parallel, including Win10. Thank you for fixing that! I did have to rebuild the nvidia-driver port, but that's to be expected. It's probably safe to assume that a good portion of third-party kernel modules will need to be rebuilt.

Got a panic, potentially related to bhyve. I've posted the core txt here: https://gist.github.com/f9933cd2397217d6acb83fb1ec1f41e7

It happened shortly after shutting down a bhyve VM. Host is HardenedBSD, the hardened/current/retpoline branch in our playground repo.

kib updated this revision to Diff 37935.Jan 14 2018, 1:28 PM

Fix problems with EPT pmaps.

Reported by: pho

loos added a subscriber: loos.Jan 15 2018, 10:33 PM
kib updated this revision to Diff 38004.Jan 15 2018, 10:35 PM

Tinderbox fixes.
Patch passed pho testing.

markj added a subscriber: markj.Jan 16 2018, 5:13 PM
markj added inline comments.
sys/amd64/amd64/pmap.c
1055

create_pagetables() references pg_g before we set it here.

1092

Don't we still need this block? What if PTI is disabled and the PCID feature flag is not set?

kib marked 2 inline comments as done.Jan 16 2018, 6:04 PM
kib added inline comments.
sys/amd64/amd64/pmap.c
1092

You are right. I do not have non-pcid machines for development.

kib updated this revision to Diff 38044.Jan 16 2018, 6:05 PM
kib marked an inline comment as done.

Fix Mark' findings.

markj added inline comments.Jan 16 2018, 7:01 PM
sys/amd64/amd64/pmap.c
2491

I can't see why we don't use VM_ALLOC_{ZERO,WIRED} here as we do for the "global" PML4 page.

7550

It seems the latest rev no longer needs the CPU 0 nmi and dblfault stack references. In particular, the symbols can remain local to machdep.c.

7684

I believe you need to unwire the page before freeing here and at the other levels.

kib marked 2 inline comments as done.Jan 16 2018, 7:56 PM
kib added inline comments.
sys/amd64/amd64/pmap.c
2491

ZERO would be a waste since I copy complete pti_pml4 page of pml4e's into the usermode pml4 page.

For WIRED, I tihink it is only important to account for the page use in v_wire_count. There it seems that the decrements were missed at all for all pti pagetable pages.

kib updated this revision to Diff 38054.Jan 16 2018, 7:57 PM

Try to fix unwire on pti page table page free.
Re-staticise nmi0 and doublefault stacks for BSP.

markj added inline comments.Jan 16 2018, 10:55 PM
sys/amd64/include/frame.h
30–31

This line should be deleted.

sys/amd64/include/pmap.h
280

It seems this change is not needed.

kib accepted this revision.Jan 17 2018, 6:32 PM
This revision is now accepted and ready to land.Jan 17 2018, 6:32 PM
kib closed this revision.Jan 17 2018, 6:33 PM