Index: sys/amd64/amd64/pmap.c =================================================================== --- sys/amd64/amd64/pmap.c +++ sys/amd64/amd64/pmap.c @@ -2695,23 +2695,138 @@ #ifdef SMP /* - * For SMP, these functions have to use the IPI mechanism for coherence. + * The amd64 pmap uses different approaches to TLB invalidation + * depending on the kernel configuration, available hardware features, + * and known hardware errata. For SMP, immediate invalidations have + * to use the IPI mechanism for TLB coherence. + + * The high operational impact configuration option is PTI, which is + * enabled automatically on affected Intel CPUs. The hardware + * features are mainly PCID, and then INVPCID instruction + * presence. PCID usage is quite different for PTI vs. non-PTI. * - * N.B.: Before calling any of the following TLB invalidation functions, - * the calling processor must ensure that all stores updating a non- - * kernel page table are globally performed. Otherwise, another - * processor could cache an old, pre-update entry without being - * invalidated. This can happen one of two ways: (1) The pmap becomes - * active on another processor after its pm_active field is checked by - * one of the following functions but before a store updating the page - * table is globally performed. (2) The pmap becomes active on another - * processor before its pm_active field is checked but due to - * speculative loads one of the following functions stills reads the - * pmap as inactive on the other processor. - * - * The kernel page table is exempt because its pm_active field is - * immutable. The kernel page table is always active on every - * processor. + * * Kernel Page Table Isolation (PTI or KPTI) is used to mitigate + * Meltdown bug in some Intel CPUs. Under PTI , each user address + * space is served by two page tables, user and kernel. The user + * page table only maps user space and a kernel trampoline. The + * kernel trampoline includes the entirety of the kernel text but + * only the kernel data that is needed to switch from user to kernel + * mode. The kernel page table maps the user and kernel address + * spaces in their entirety. It is identical to the per-process + * page table allocated in non-PTI mode. + * + * Note that user space part of the kernel page tables is used for + * copyout(9) and needs to maintain TLB coherency. User page tables + * are only used when CPU is in user mode, and some invalidations + * can be postponed until the switch from kernel mode to user mode. + * + * Presence of the usermode pagetable for the given pmap is indicated + * by pm_ucr3 value different from PMAP_NO_CR3, in which case it contains + * the %cr3 register value for user mode page tables root. + * + * * The pm_active bitmask indicates which CPUs have pmap active + * currently, the bit is set on context switch to, and cleared on + * switching off this CPU. For kernel page table, pm_active field + * is immutable and contains all CPUs. The kernel page table is + * always logically active on every processor, but not necessarily + * present in hardware, e.g. in PTI mode. + * + * When requesting invalidation of virtual addresses with + * pmap_invalidate_XXX() functions, pmap sends shootdown IPIs to all + * CPUs recorded in pm_active. The pm_active updates are not + * synchronized and its reading is necessary racy. Shootdown + * handlers are prepared to handle the race. + * + * * PCID is an optional feature of the long mode x86 MMU where TLB + * entries are tagged with the 'Process ID' of the address space + * they belong to. PCID provides limited namespace for process + * identifiers, 12 bits, 4095 simultaneous IDs total. + * + * Allocation of PCID to pmap is done by an algorithm described in + * the book of Vahalia' "Unix Internals" section 15.12 "Other TLB + * Consistency Algorithms". PCID cannot be allocated for the whole + * pmap lifetime in pmap_pinit() due to the limited namespace. + * Instead, a per-CPU, per-pmap PCID is assigned when CPU is about + * to start caching TLB entries from a pmap, i.e. on the context + * switch which activates the pmap on the CPU. + * + * The PCID allocator maintains a per-CPU, per-pmap generation count + * pm_gen which is incremented each time a new PCID is allocated. + * On invalidation, the generation counters for the pmap is zeroed, + * which signals the context switch code that already allocated PCID + * is no longer valid. The implication is the TLB shootdown for the + * given cpu/address space, due to the allocation of new PCID. + * Zeroing can be performed remotely. + * + * * PTI + PCID. The available PCIDs are divided into two sets: PCIDs for + * complete (kernel) page tables, and PCIDs for usermode page tables. + * User PCID value is obtained from the kernel PCID value by setting the + * highest bit 11 to 1 (0x800 == PMAP_PCID_USER_PT). + * + * Userspace page tables are activated on return to usermode, by loading + * pm_ucr3 into %cr3. If the PCPU(ucr3_load_mask) requests clearing the + * bit 63 of loaded ucr3, this effectively causes total invalidation of + * the usermode TLB. If ucr3_load_mask is set, then local invalidations + * of individual pages in user page table are skipped. + * + * * Local invalidation, all modes. If requested invalidation of + * specific address or total invalidation for pmap that is currently + * active, pmap explicitly flushes TLB using INVTLB for kernel page + * table, and INVPCID(INVPCID_CTXGLOB)/invltlb_glob(). + * + * If INVPCID instruction is available, it is used to flush entries + * from kernel page table. + * + * * mode: PTI disabled, PCID present. Kernel reserves PCID 0 for its + * address space, all other 4095 PCIDs are used for usermode spaces + * as described above. Context switch allocates new PCID if + * recorded pcid is zero or recorded generation does not match CPU + * generation, effectively flushing TLB for this address space. + * Total remote invalidation is performed by zeroing pm_gen for all CPUs. + * local user page: INVLPG + * local kernel page: INVLPG + * local user total: INVPCID(CTX) + * local kernel total: INVPCID(CTXGLOB) or invltlb_glob() + * remote user page inactive: zero pm_gen + * remote user page active: zero pm_gen + IPI:INVLPG + * remote kernel page: IPI:INVLPG + * remote user total inactive: zero pm_gen + * remote user total active: zero pm_gen + IPI:(INVPCID(CTX) or + * reload %cr3) + * remote kernel total: IPI:(INVPCID(CTXGLOB) or invltlb_glob()) + * + * PTI enabled, PCID present. + * local user page: INVLPG for kpt, INVPCID(ADDR) or (INVLPG for ucr3) + * for upt + * local kernel page: INVLPG + * local user total: INVPCID(CTX) or reload %cr3 for kpt, clear PCID_SAVE + * on loading UCR3 into %cr3 for upt + * local kernel total: INVPCID(CTXGLOB) or invltlb_glob() + * remote user page inactive: zero pm_gen + * remote user page active: zero pm_gen + IPI:(INVLPG for kpt, + * INVPCID(ADDR) for upt) + * remote kernel page: IPI:INVLPG + * remote user total inactive: zero pm_gen + * remote user total active: zero pm_gen + IPI:(INVPCID(CTX) for kpt, + * clear PCID_SAVE on loading UCR3 into $cr3 for upt) + * remote kernel total: IPI:(INVPCID(CTXGLOB) or invltlb_glob()) + * + * No PCID. + * local user page: INVLPG + * local kernel page: INVLPG + * local user total: reload %cr3 + * local kernel total: invltlb_glob() + * remote user page inactive: - + * remote user page active: IPI:INVLPG + * remote kernel page: IPI:INVLPG + * remote user total inactive: - + * remote user total active: IPI:(reload %cr3) + * remote kernel total: IPI:invltlb_glob() + * Since on return to usermode, the reload of %cr3 with ucr3 causes + * TLB invalidation, no specific action is required for upt. + * + * EPT. EPT pmaps do not map KVA, all mappings are userspace. + * XXX TODO */ /*