Details

Reviewers

Summary

The implementation of the Kernel Page Table Isolation (KPTI) for amd64, first version. It provides a workaround for the 'meltdown' vulnerability.

The pmap page table is split into kernel-mode table and user-mode table. Kernel-mode table is identical to the non-PTI table, while usermode table is obtained from kernel table by leaving userspace mappings intact, but only leaving the following parts of the kernel mapped:

kernel text (but not modules text)
PCPU
GDT/IDT/task structures
IST stacks for NMI and doublefault handlers.

Kernel switches to user page table before returning to usermode, and restores full kernel page table on the entry. Initial kernel-mode stack for PTI trampoline is allocated in PCPU, it is only 16 qwords. Kernel entry trampoline switches page tables. then the hardware trap frame is copied to the normal kstack, and execution continues.

IST stacks are kept mapped and no trampoline is needed for NMI/doublefault, but of course page table switch is performed.

On return to usermode, the trampoline is used again, iret frame is copied to the trampoline stack, page tables are switched and iretq is executed. The case of iretq faulting due to the invalid usermode context is tricky, since the frame for fault is appended to the trampoline frame. Besides copying the fault frame and original (corrupted) frame to kstack, the fault frame must be patched to make it look as if the fault occured on the kstack, see the comment in doret_iret detection code in trap().

Currently kernel pages which are mapped during trampoline operation are identical for all pmaps. They are registered using pmap_pti_add_kva(). Besides initial registrations done during boot, LDT and non-common TSS segments are registered if user requested their use. In principle, they can be installed into kernel page table per pmap with some work. Similarly, PCPU can be hidden from userspace mapping using trampoline PCPU page, but again I do not see much benefits besides complexity.

PDPE pages are pre-allocated during boot because we need to know pml4 entries which are copied to the top-level paging structure page, in advance on a new pmap creation. I enforce this to avoid iterating over the all existing pmaps if a new PDPE page is needed for PTI kernel mappings. The iteration is a known problematic operation on i386.

The need to flush hidden kernel translations on the switch to user mode make global tables (PG_G) meaningless and even harming, so PG_G use is disabled for PTI case. Our existing use of PCID is incompatible with PTI and is automatically disabled if PTI is enabled. PCID can be forced on only for developer's benefit.

MCE is known to be broken, it requires IST stack to operate completely correctly even for non-PTI case, and absolutely needs dedicated IST stack because MCE delivery while trampoline did not switched from PTI stack is fatal. The fix is pending.

Test Plan

Patch survived pho' stress testing.

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes

ltning-freebsd_anduin.net added a subscriber: ltning-freebsd_anduin.net.Jan 9 2018, 12:30 AM

Can we get a high level description of these changes so we're not having to infer details from code?

jeff added inline comments.Jan 9 2018, 12:54 AM

sys/amd64/amd64/apic_vector.S
128	Are we not losing this alignment in your new version?
sys/amd64/amd64/pmap.c
945	I would find this whole section more readable with a single 'gflag' variable and a single test for pti; ie; gflag = pti ? 0 : X86_PG_G; pt_p[i] = .... \| gflag;

kaktus added a subscriber: kaktus.Jan 9 2018, 8:49 AM

Joerg.Pulz_frm2.tum.de added a subscriber: Joerg.Pulz_frm2.tum.de.Jan 9 2018, 8:54 AM

egypcio added a subscriber: egypcio.Jan 9 2018, 9:38 AM

pti ? PG_G : 0 -> pg_g
realign both pti and normal interrupt entries
remove remnants of the failed PCID experiment

Do commit the realignment of the interrupt entries.

emaste added a subscriber: emaste.Jan 9 2018, 12:44 PM

I've now reproduced the issues on vanilla FreeBSD:

# make -sj6 buildkernel
make[1]: "/usr/src/Makefile.inc1" line 155: SYSTEM_COMPILER: Determined that CC=cc matches the source tree.  Not bootstrapping a cross-compiler.
                   
--------------------------------------------------------------
>>> Kernel build for GENERIC started on Tue Jan  9 03:02:49 EST 2018
--------------------------------------------------------------
===> GENERIC

--------------------------------------------------------------
>>> stage 1: configuring the kernel                                                                                                                                               
--------------------------------------------------------------
Kernel build directory is /usr/obj/usr/src/amd64.amd64/sys/GENERIC
Don't forget to do ``make cleandepend && make depend''

--------------------------------------------------------------
>>> stage 2.1: cleaning up the object tree
--------------------------------------------------------------
rm: /usr/obj/usr/src/amd64.amd64/sys/GENERIC/modules/usr/src/sys/modules/drm2/radeonkmsfw/R600_me/machine: No child processes
rm: /usr/obj/usr/src/amd64.amd64/sys/GENERIC/modules/usr/src/sys/modules/drm2/radeonkmsfw/R600_me: Directory not empty

emaste added inline comments.Jan 9 2018, 2:25 PM

sys/amd64/amd64/apic_vector.S
128	INTR_HANDLER is .macro INTR_HANDLER vec_name .text PTI_INTRENTRY \vec_name INTR_PUSH_FRAME \vec_name .endm and INTR_PUSH_FRAME is: .macro INTR_PUSH_FRAME vec_name SUPERALIGN_TEXT ...

marc_dutchsecurity.net added a subscriber: marc_dutchsecurity.net.Jan 9 2018, 2:46 PM

eborisch_gmail.com added a subscriber: eborisch_gmail.com.Jan 9 2018, 6:10 PM

I generated an installer image (memstick.img) with the latest PTI patch to test a fresh installation of vanilla FreeBSD with the PTI patch applied. Extracting the distsets failed.

Attempting to run with vm.pmap.pti=0 set results in this kernel panic in early boot: https://photos.app.goo.gl/bm29U8ChZDAmnF0B3

jan.kokemueller_gmail.com added a subscriber: jan.kokemueller_gmail.com.Jan 9 2018, 7:30 PM

kalle.carlbark_freebsd_kcbark.net added a subscriber: kalle.carlbark_freebsd_kcbark.net.Jan 9 2018, 7:54 PM

djskpv_yahoo.com added a subscriber: djskpv_yahoo.com.Jan 9 2018, 11:45 PM

gstim85_gmail.com added a subscriber: gstim85_gmail.com.Jan 10 2018, 12:02 AM

seanc added a subscriber: seanc.Jan 10 2018, 12:25 AM

Just providing a test case:

On a 12.0-CURRENT r327750 with the latest (37676) patch applied, I can get random failures by running find driving md5 on, for example, /usr/src:

# while true; do find /usr/src/ -type f -print0 | xargs -0 -P 2 md5 > /dev/null; done
xargs: md5: terminated with signal 11; aborting
md5: /usr/src/sys/arm/include/stdarg.h: Bad file descriptor
md5: /usr/src/.svn/pristine/36/36e5ccf4b4cd53a38f248b91193b0d425bbf069e.svn-base: Bad file descriptor
find: /usr/src/contrib/dialog/samples/programbox2: Bad file descriptor
md5: /usr/src/contrib/compiler-rt/lib/builtins/umoddi3.c: Bad file descriptor
md5: /usr/src/contrib/llvm/lib/Target/SystemZ/SystemZInstrVector.td: Invalid argument
md5: /usr/src/contrib/atf/atf-c++/detail/Kyuafile: Bad file descriptor
find: /usr/src/share/timedef/sv_SE.UTF-8.src: Bad file descriptor
find: /usr/src/lib/libc/stdio/tmpnam.3: Bad file descriptor
md5: /usr/src/sbin/hastd/hast.h: File name too long

If I run with fewer CPUs (above was with 4; host has 2 cores + HT) it hits errors easier with lower numbers of threads. (With 2 or fewer CPUs, it hits errors fairly quickly with -P 1)... Experienced this first on ZFS, and then reproduced on UFS.

Not providing any fix here, but sometimes a simple reproducer can be useful.

On the non-patched kernel, I can run the same VM with 1-4 cores and -P 0 rock solid (as one would hope); with a load ~ 4-13 depending on # of cores.

Thanks for the hard work on this!

mail_doschoofs.de added a subscriber: mail_doschoofs.de.Jan 10 2018, 4:15 PM

My FreeBSD bhyve VM with the patch applied can boot with vm.pmap.pti set to 0. However, even with it set to 0, I get weird runtime errors. Like this when running make -sj6 buildkernel:

LLVM ERROR: IO failure on output stream.
--- ar9300_eeprom.o ---       
*** [ar9300_eeprom.o] Error code 1

On syscall entry, evacuate saved %rax from pcpu before allowing context switches.
Fill tss_rsp0 on context switch when on non-default tss and pti is enabled.
Correct the invalidated range in pmap_pti_remove_kva().

mmokhi added a subscriber: mmokhi.Jan 10 2018, 10:14 PM

mmokhi added inline comments.

sys/amd64/include/asmacros.h
2	style ( 😅 ): Is it accidentally left here by your editor?

In D13797#289713, @eborisch_gmail.com wrote:

Just providing a test case:

On a 12.0-CURRENT r327750 with the latest (37676) patch applied, I can get random failures by running find driving md5 on, for example, /usr/src:

Just following up -- I'm not getting errors anymore with from this simple test on 37740.

kib added inline comments.Jan 10 2018, 11:06 PM

sys/amd64/include/asmacros.h
2	This is intended. Otherwise emacs gets too confused about native as macros.

This passed a buildworld + buildkernel smoketest on my AMD Threadripper system. I didn't pay very close attention to the time to complete. It's somewhat slower compared to an unpatched kernel.

danilo added a subscriber: danilo.Jan 11 2018, 3:26 AM

xistence_0x58.com added inline comments.Jan 11 2018, 4:27 AM

sys/amd64/amd64/pmap.c
1099	Any particular reason for disabling PCID by default if PTI is enabled?

Re-merge the branch.

kib added inline comments.Jan 11 2018, 1:29 PM

sys/amd64/amd64/pmap.c
1099	Because they do not work together. I plan to work on this after the first patch is finalized.

mmokhi added inline comments.Jan 12 2018, 2:52 PM

sys/amd64/include/asmacros.h
2	Aha, I see. Thanks for explanation 🙏

stevek added a subscriber: stevek.Jan 12 2018, 5:38 PM

apevnev_me.com added a subscriber: apevnev_me.com.Jan 12 2018, 6:09 PM

fsbruva_yahoo.com added a subscriber: fsbruva_yahoo.com.Jan 12 2018, 8:08 PM

Snapshot today' version of the patch containing assorted fixes.

Most important, the doublefault stack is mapped into PTI userspace.

NMI still does not work. Also apparently MCE is broken.

nullius_nym.zone added a subscriber: nullius_nym.zone.Jan 12 2018, 8:55 PM

Correct PTI mapping for GDT.

Fix PTI mappings for ist stacks. NMI, and particularly, pmc should work better now.

Add FF copyright.

kib edited the summary of this revision. (Show Details)Jan 13 2018, 3:30 PM

kib edited the test plan for this revision. (Show Details)

kib edited the summary of this revision. (Show Details)Jan 13 2018, 3:33 PM

kib edited the summary of this revision. (Show Details)

kib edited the summary of this revision. (Show Details)Jan 13 2018, 3:36 PM

kib edited the summary of this revision. (Show Details)

Simplify init.

Do you plan to backport the fix for 11-STABLE too? If so, do you plan to MFC all of the vmmeter refactor - which moves vmmeter counters into separate counter_t from PCPU - or with extending the PCPU structure?

In D13797#291293, @op wrote:

Do you plan to backport the fix for 11-STABLE too? If so, do you plan to MFC all of the vmmeter refactor - which moves vmmeter counters into separate counter_t from PCPU - or with extending the PCPU structure?

The patch will be merged to 11, I do not plan to merge convertion of vmmeter for backport.

lohithbsd_gmail.com added a subscriber: lohithbsd_gmail.com.Jan 13 2018, 8:15 PM

This comment brought to you by my HardenedBSD laptop running with your PTI patch and the retpoline patch from llvm. I'm happy to report that everything is working fine, albeit with a noticeable lag in some cases. I have multiple bhyve VMs running in parallel, including Win10. Thank you for fixing that! I did have to rebuild the nvidia-driver port, but that's to be expected. It's probably safe to assume that a good portion of third-party kernel modules will need to be rebuilt.

Got a panic, potentially related to bhyve. I've posted the core txt here: https://gist.github.com/f9933cd2397217d6acb83fb1ec1f41e7

It happened shortly after shutting down a bhyve VM. Host is HardenedBSD, the hardened/current/retpoline branch in our playground repo.

Fix problems with EPT pmaps.

Reported by: pho

marius.h_lden.org added a subscriber: marius.h_lden.org.Jan 15 2018, 6:40 PM

loos added a subscriber: loos.Jan 15 2018, 10:33 PM

Tinderbox fixes.
Patch passed pho testing.

wollman added a subscriber: wollman.Jan 16 2018, 5:38 AM

markj added a subscriber: markj.Jan 16 2018, 5:13 PM

markj added inline comments.

sys/amd64/amd64/pmap.c
1055	create_pagetables() references pg_g before we set it here.
1092	Don't we still need this block? What if PTI is disabled and the PCID feature flag is not set?

kib marked 2 inline comments as done.Jan 16 2018, 6:04 PM

kib added inline comments.

sys/amd64/amd64/pmap.c
1092	You are right. I do not have non-pcid machines for development.

Fix Mark' findings.

markj added inline comments.Jan 16 2018, 7:01 PM

sys/amd64/amd64/pmap.c
2491	I can't see why we don't use VM_ALLOC_{ZERO,WIRED} here as we do for the "global" PML4 page.
7550	It seems the latest rev no longer needs the CPU 0 nmi and dblfault stack references. In particular, the symbols can remain local to machdep.c.
7684	I believe you need to unwire the page before freeing here and at the other levels.

kib marked 2 inline comments as done.Jan 16 2018, 7:56 PM

kib added inline comments.

sys/amd64/amd64/pmap.c
2491	ZERO would be a waste since I copy complete pti_pml4 page of pml4e's into the usermode pml4 page. For WIRED, I tihink it is only important to account for the page use in v_wire_count. There it seems that the decrements were missed at all for all pti pagetable pages.

Try to fix unwire on pti page table page free.
Re-staticise nmi0 and doublefault stacks for BSP.

markj added inline comments.Jan 16 2018, 10:55 PM

sys/amd64/include/frame.h
30–31	This line should be deleted.
sys/amd64/include/pmap.h
280	It seems this change is not needed.

nomoo_nomoo.ru added a subscriber: nomoo_nomoo.ru.Jan 17 2018, 12:31 PM

Committed in rS328083

kib accepted this revision.Jan 17 2018, 6:32 PM

This revision is now accepted and ready to land.Jan 17 2018, 6:32 PM

kib closed this revision.Jan 17 2018, 6:33 PM

nannangar_zscaler.com added a subscriber: nannangar_zscaler.com.Jan 22 2018, 1:50 PM

ray_truedays.org added a subscriber: ray_truedays.org.Jan 22 2018, 11:12 PM

Follow-up commits in rS328096 rS328116 rS328119 rS328120 rS328128 rS328153 rS328135 rS328157 rS328177 rS328199 rS328202 rS328205 rS328468

PTI for amd64.
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 38054

sys/amd64/amd64/apic_vector.S

sys/amd64/amd64/atpic_vector.S

sys/amd64/amd64/cpu_switch.S

sys/amd64/amd64/exception.S

sys/amd64/amd64/genassym.c

sys/amd64/amd64/machdep.c

sys/amd64/amd64/mp_machdep.c

sys/amd64/amd64/pmap.c

sys/amd64/amd64/sys_machdep.c

sys/amd64/amd64/trap.c

sys/amd64/amd64/vm_machdep.c

sys/amd64/ia32/ia32_exception.S

sys/amd64/ia32/ia32_syscall.c

sys/amd64/include/asmacros.h

sys/amd64/include/frame.h

sys/amd64/include/pcpu.h

sys/amd64/include/pmap.h

sys/amd64/include/proc.h

sys/amd64/include/smp.h

sys/amd64/vmm/intel/vmx.c

sys/amd64/vmm/vmm.c

sys/dev/hyperv/vmbus/amd64/vmbus_vector.S

sys/i386/i386/apic_vector.s

sys/i386/i386/atpic_vector.s

sys/i386/i386/exception.s

sys/i386/i386/pmap.c

sys/x86/include/apicvar.h

sys/x86/include/x86_var.h

sys/x86/isa/atpic.c

sys/x86/x86/local_apic.c

PTI for amd64.ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 38054

sys/amd64/amd64/apic_vector.S

sys/amd64/amd64/atpic_vector.S

sys/amd64/amd64/cpu_switch.S

sys/amd64/amd64/exception.S

sys/amd64/amd64/genassym.c

sys/amd64/amd64/machdep.c

sys/amd64/amd64/mp_machdep.c

sys/amd64/amd64/pmap.c

sys/amd64/amd64/sys_machdep.c

sys/amd64/amd64/trap.c

sys/amd64/amd64/vm_machdep.c

sys/amd64/ia32/ia32_exception.S

sys/amd64/ia32/ia32_syscall.c

sys/amd64/include/asmacros.h

sys/amd64/include/frame.h

sys/amd64/include/pcpu.h

sys/amd64/include/pmap.h

sys/amd64/include/proc.h

sys/amd64/include/smp.h

sys/amd64/vmm/intel/vmx.c

sys/amd64/vmm/vmm.c

sys/dev/hyperv/vmbus/amd64/vmbus_vector.S

sys/i386/i386/apic_vector.s

sys/i386/i386/atpic_vector.s

sys/i386/i386/exception.s

sys/i386/i386/pmap.c

sys/x86/include/apicvar.h

sys/x86/include/x86_var.h

sys/x86/isa/atpic.c

sys/x86/x86/local_apic.c

PTI for amd64.
ClosedPublic
Actions

Revision Contents
Changeset List