Use PCID to avoid complete TLB shootdown when switching between user and kernel mode with PTI enabled.
I use the model close to what I read about KAISER, user-mode PCID has 1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID. Full kernel-mode TLB shootdown is performed on context switches, since KVA TLB invalidation only works in the current pmap. On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set and we do not clear TLB.
I can imagine alternative use of PCID, where there is only one PCID allocated for the kernel pmap. Then, there is no need to shootdown kernel TLB entries on context switch. But copyout(3) would need to either use method similar to proc_rwmem() to access the userspace data, or (in reverse) provide a temporal mapping for the kernel buffer into user mode PCID and use trampoline for copy.
Below is the comparison of the PTI overhead on a haswell machine, which has INVPCID instruction. Patch is not yet tested, I only booted multiuser on haswell (INVPCID) and sandy (PCID but no INVPCID).
./syscall_timing_64 getppid
PCID pti
Clock resolution: 0.000000001
test loop time iterations periteration
getppid 0 1.013932533 2119160 0.000000478
getppid 1 1.018957555 2129562 0.000000478
getppid 2 1.000971159 2091990 0.000000478
getppid 3 1.000953178 2091982 0.000000478
getppid 4 1.020247113 2131976 0.000000478
getppid 5 1.037923614 2169751 0.000000478
getppid 6 1.047713414 2189130 0.000000478
getppid 7 1.038963785 2171784 0.000000478
getppid 8 1.039963943 2173825 0.000000478
getppid 9 1.035961526 2165531 0.000000478
PCID !pti
Clock resolution: 0.000000001
test loop time iterations periteration
getppid 0 1.012187691 2985156 0.000000339
getppid 1 1.023738555 3018773 0.000000339
getppid 2 1.042912187 3075443 0.000000339
getppid 3 1.043958246 3078485 0.000000339
getppid 4 1.043985730 3078480 0.000000339
getppid 5 1.042927727 3075419 0.000000339
getppid 6 1.043948467 3078553 0.000000339
getppid 7 1.035955084 3054875 0.000000339
getppid 8 1.000223259 2949554 0.000000339
getppid 9 1.007950158 2972239 0.000000339
!PCID pti
Clock resolution: 0.000000001
test loop time iterations periteration
getppid 0 1.038952102 1719328 0.000000604
getppid 1 1.024812746 1695689 0.000000604
getppid 2 1.008090182 1667812 0.000000604
getppid 3 1.042990431 1725911 0.000000604
getppid 4 1.043929676 1726714 0.000000604
getppid 5 1.042965631 1725831 0.000000604
getppid 6 1.043961847 1727683 0.000000604
getppid 7 1.042965348 1726560 0.000000604
getppid 8 1.043959939 1727769 0.000000604
getppid 9 1.043970836 1727738 0.000000604