Page MenuHomeFreeBSD

D25815.id79332.diff
No OneTemporary

D25815.id79332.diff

Index: sys/amd64/amd64/pmap.c
===================================================================
--- sys/amd64/amd64/pmap.c
+++ sys/amd64/amd64/pmap.c
@@ -2695,23 +2695,138 @@
#ifdef SMP
/*
- * For SMP, these functions have to use the IPI mechanism for coherence.
+ * The amd64 pmap uses different approaches to TLB invalidation
+ * depending on the kernel configuration, available hardware features,
+ * and known hardware errata. For SMP, immediate invalidations have
+ * to use the IPI mechanism for TLB coherence.
+
+ * The high operational impact configuration option is PTI, which is
+ * enabled automatically on affected Intel CPUs. The hardware
+ * features are mainly PCID, and then INVPCID instruction
+ * presence. PCID usage is quite different for PTI vs. non-PTI.
*
- * N.B.: Before calling any of the following TLB invalidation functions,
- * the calling processor must ensure that all stores updating a non-
- * kernel page table are globally performed. Otherwise, another
- * processor could cache an old, pre-update entry without being
- * invalidated. This can happen one of two ways: (1) The pmap becomes
- * active on another processor after its pm_active field is checked by
- * one of the following functions but before a store updating the page
- * table is globally performed. (2) The pmap becomes active on another
- * processor before its pm_active field is checked but due to
- * speculative loads one of the following functions stills reads the
- * pmap as inactive on the other processor.
- *
- * The kernel page table is exempt because its pm_active field is
- * immutable. The kernel page table is always active on every
- * processor.
+ * * Kernel Page Table Isolation (PTI or KPTI) is used to mitigate
+ * Meltdown bug in some Intel CPUs. Under PTI , each user address
+ * space is served by two page tables, user and kernel. The user
+ * page table only maps user space and a kernel trampoline. The
+ * kernel trampoline includes the entirety of the kernel text but
+ * only the kernel data that is needed to switch from user to kernel
+ * mode. The kernel page table maps the user and kernel address
+ * spaces in their entirety. It is identical to the per-process
+ * page table allocated in non-PTI mode.
+ *
+ * Note that user space part of the kernel page tables is used for
+ * copyout(9) and needs to maintain TLB coherency. User page tables
+ * are only used when CPU is in user mode, and some invalidations
+ * can be postponed until the switch from kernel mode to user mode.
+ *
+ * Presence of the usermode pagetable for the given pmap is indicated
+ * by pm_ucr3 value different from PMAP_NO_CR3, in which case it contains
+ * the %cr3 register value for user mode page tables root.
+ *
+ * * The pm_active bitmask indicates which CPUs have pmap active
+ * currently, the bit is set on context switch to, and cleared on
+ * switching off this CPU. For kernel page table, pm_active field
+ * is immutable and contains all CPUs. The kernel page table is
+ * always logically active on every processor, but not necessarily
+ * present in hardware, e.g. in PTI mode.
+ *
+ * When requesting invalidation of virtual addresses with
+ * pmap_invalidate_XXX() functions, pmap sends shootdown IPIs to all
+ * CPUs recorded in pm_active. The pm_active updates are not
+ * synchronized and its reading is necessary racy. Shootdown
+ * handlers are prepared to handle the race.
+ *
+ * * PCID is an optional feature of the long mode x86 MMU where TLB
+ * entries are tagged with the 'Process ID' of the address space
+ * they belong to. PCID provides limited namespace for process
+ * identifiers, 12 bits, 4095 simultaneous IDs total.
+ *
+ * Allocation of PCID to pmap is done by an algorithm described in
+ * the book of Vahalia' "Unix Internals" section 15.12 "Other TLB
+ * Consistency Algorithms". PCID cannot be allocated for the whole
+ * pmap lifetime in pmap_pinit() due to the limited namespace.
+ * Instead, a per-CPU, per-pmap PCID is assigned when CPU is about
+ * to start caching TLB entries from a pmap, i.e. on the context
+ * switch which activates the pmap on the CPU.
+ *
+ * The PCID allocator maintains a per-CPU, per-pmap generation count
+ * pm_gen which is incremented each time a new PCID is allocated.
+ * On invalidation, the generation counters for the pmap is zeroed,
+ * which signals the context switch code that already allocated PCID
+ * is no longer valid. The implication is the TLB shootdown for the
+ * given cpu/address space, due to the allocation of new PCID.
+ * Zeroing can be performed remotely.
+ *
+ * * PTI + PCID. The available PCIDs are divided into two sets: PCIDs for
+ * complete (kernel) page tables, and PCIDs for usermode page tables.
+ * User PCID value is obtained from the kernel PCID value by setting the
+ * highest bit 11 to 1 (0x800 == PMAP_PCID_USER_PT).
+ *
+ * Userspace page tables are activated on return to usermode, by loading
+ * pm_ucr3 into %cr3. If the PCPU(ucr3_load_mask) requests clearing the
+ * bit 63 of loaded ucr3, this effectively causes total invalidation of
+ * the usermode TLB. If ucr3_load_mask is set, then local invalidations
+ * of individual pages in user page table are skipped.
+ *
+ * * Local invalidation, all modes. If requested invalidation of
+ * specific address or total invalidation for pmap that is currently
+ * active, pmap explicitly flushes TLB using INVTLB for kernel page
+ * table, and INVPCID(INVPCID_CTXGLOB)/invltlb_glob().
+ *
+ * If INVPCID instruction is available, it is used to flush entries
+ * from kernel page table.
+ *
+ * * mode: PTI disabled, PCID present. Kernel reserves PCID 0 for its
+ * address space, all other 4095 PCIDs are used for usermode spaces
+ * as described above. Context switch allocates new PCID if
+ * recorded pcid is zero or recorded generation does not match CPU
+ * generation, effectively flushing TLB for this address space.
+ * Total remote invalidation is performed by zeroing pm_gen for all CPUs.
+ * local user page: INVLPG
+ * local kernel page: INVLPG
+ * local user total: INVPCID(CTX)
+ * local kernel total: INVPCID(CTXGLOB) or invltlb_glob()
+ * remote user page inactive: zero pm_gen
+ * remote user page active: zero pm_gen + IPI:INVLPG
+ * remote kernel page: IPI:INVLPG
+ * remote user total inactive: zero pm_gen
+ * remote user total active: zero pm_gen + IPI:(INVPCID(CTX) or
+ * reload %cr3)
+ * remote kernel total: IPI:(INVPCID(CTXGLOB) or invltlb_glob())
+ *
+ * PTI enabled, PCID present.
+ * local user page: INVLPG for kpt, INVPCID(ADDR) or (INVLPG for ucr3)
+ * for upt
+ * local kernel page: INVLPG
+ * local user total: INVPCID(CTX) or reload %cr3 for kpt, clear PCID_SAVE
+ * on loading UCR3 into %cr3 for upt
+ * local kernel total: INVPCID(CTXGLOB) or invltlb_glob()
+ * remote user page inactive: zero pm_gen
+ * remote user page active: zero pm_gen + IPI:(INVLPG for kpt,
+ * INVPCID(ADDR) for upt)
+ * remote kernel page: IPI:INVLPG
+ * remote user total inactive: zero pm_gen
+ * remote user total active: zero pm_gen + IPI:(INVPCID(CTX) for kpt,
+ * clear PCID_SAVE on loading UCR3 into $cr3 for upt)
+ * remote kernel total: IPI:(INVPCID(CTXGLOB) or invltlb_glob())
+ *
+ * No PCID.
+ * local user page: INVLPG
+ * local kernel page: INVLPG
+ * local user total: reload %cr3
+ * local kernel total: invltlb_glob()
+ * remote user page inactive: -
+ * remote user page active: IPI:INVLPG
+ * remote kernel page: IPI:INVLPG
+ * remote user total inactive: -
+ * remote user total active: IPI:(reload %cr3)
+ * remote kernel total: IPI:invltlb_glob()
+ * Since on return to usermode, the reload of %cr3 with ucr3 causes
+ * TLB invalidation, no specific action is required for upt.
+ *
+ * EPT. EPT pmaps do not map KVA, all mappings are userspace.
+ * XXX TODO
*/
/*

File Metadata

Mime Type
text/plain
Expires
Sat, Feb 7, 1:20 PM (2 h, 25 m)
Storage Engine
blob
Storage Format
Raw Data
Storage Handle
28456434
Default Alt Text
D25815.id79332.diff (7 KB)

Event Timeline