Page MenuHomeFreeBSD

Use PCID to optimize PTI.

Authored by kib on Jan 19 2018, 3:04 PM.



Use PCID to avoid complete TLB shootdown when switching between user and kernel mode with PTI enabled.

I use the model close to what I read about KAISER, user-mode PCID has 1:1 correspondence to the kernel-mode PCID, by setting bit 11 in PCID. Full kernel-mode TLB shootdown is performed on context switches, since KVA TLB invalidation only works in the current pmap. On the other hand, on kernel/user switches, CR3_PCID_SAVE bit is set and we do not clear TLB.

I can imagine alternative use of PCID, where there is only one PCID allocated for the kernel pmap. Then, there is no need to shootdown kernel TLB entries on context switch. But copyout(3) would need to either use method similar to proc_rwmem() to access the userspace data, or (in reverse) provide a temporal mapping for the kernel buffer into user mode PCID and use trampoline for copy.

Below is the comparison of the PTI overhead on a haswell machine, which has INVPCID instruction. Patch is not yet tested, I only booted multiuser on haswell (INVPCID) and sandy (PCID but no INVPCID).

./syscall_timing_64 getppid

PCID pti
Clock resolution: 0.000000001
test loop time iterations periteration
getppid 0 1.013932533 2119160 0.000000478
getppid 1 1.018957555 2129562 0.000000478
getppid 2 1.000971159 2091990 0.000000478
getppid 3 1.000953178 2091982 0.000000478
getppid 4 1.020247113 2131976 0.000000478
getppid 5 1.037923614 2169751 0.000000478
getppid 6 1.047713414 2189130 0.000000478
getppid 7 1.038963785 2171784 0.000000478
getppid 8 1.039963943 2173825 0.000000478
getppid 9 1.035961526 2165531 0.000000478

PCID !pti
Clock resolution: 0.000000001
test loop time iterations periteration
getppid 0 1.012187691 2985156 0.000000339
getppid 1 1.023738555 3018773 0.000000339
getppid 2 1.042912187 3075443 0.000000339
getppid 3 1.043958246 3078485 0.000000339
getppid 4 1.043985730 3078480 0.000000339
getppid 5 1.042927727 3075419 0.000000339
getppid 6 1.043948467 3078553 0.000000339
getppid 7 1.035955084 3054875 0.000000339
getppid 8 1.000223259 2949554 0.000000339
getppid 9 1.007950158 2972239 0.000000339

!PCID pti
Clock resolution: 0.000000001
test loop time iterations periteration
getppid 0 1.038952102 1719328 0.000000604
getppid 1 1.024812746 1695689 0.000000604
getppid 2 1.008090182 1667812 0.000000604
getppid 3 1.042990431 1725911 0.000000604
getppid 4 1.043929676 1726714 0.000000604
getppid 5 1.042965631 1725831 0.000000604
getppid 6 1.043961847 1727683 0.000000604
getppid 7 1.042965348 1726560 0.000000604
getppid 8 1.043959939 1727769 0.000000604
getppid 9 1.043970836 1727738 0.000000604

Test Plan

Peter, could you, please, test the patch ?

I did several simple smoke-testing actions, in particular, compiled the openjdk port on debug and production kernel. It is interesting to test this both on machine which reports INVPCID CPU feature, and on machine with PCID but without INVPCID, amd64 only.

Thank you in advance.

Diff Detail

rS FreeBSD src repository - subversion
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

This passes smoke test on my Sandy Bridge laptop (X220).

kib added a subscriber: pho.
503 ↗(On Diff #38207)

Perhaps cache the cpuid here and in the other handlers, rather than fetching from pcpu area multiple times.

468 ↗(On Diff #38251)

IMHO it would be clearer to write

if (smp_tlb_pmap == kernel_pmap) {
    d.pcid = 0;
    invpcid(&d, INVPCID_CTXGLOB);
} else {
    d.pcid = smp_tlb_pmap->pm_pcids[PCPU_GET(cpuid)].pm_pcid;
    invpcid(&d, INVPCID_CTX);
    d.pcid |= PMAP_PCID_USER_PT;
    invpcid(&d, INVPCID_CTX);
560 ↗(On Diff #38251)

Seems we are missing the user PCID bit here?

597 ↗(On Diff #38251)

Extra newline.

1598 ↗(On Diff #38251)

Why does this need to be a critical section?

807 ↗(On Diff #38251)

Can't these routines be implemented in C?

kib marked 3 inline comments as done.Jan 20 2018, 8:14 PM
kib added inline comments.
1598 ↗(On Diff #38251)

pm_pcid are recalculated on the context switch. If we are removed from CPU and then rescheduled (pin still allows that) we would end up using incorrect pcid for the operation.

807 ↗(On Diff #38251)

This is quite risky. We are temporary swapping address spaces there, in particular, we enter the AS where the C stack is not mapped. I absolutely need to ensure that the routines do not access anything outside the explicitly allowed addresses while on user page table.

kib marked an inline comment as done.Jan 20 2018, 8:16 PM

Shoot down user page table TLB on context switches. It is done automatically only for the kernel twin.
FIx bug pointed out by Mark where wrong PCID was used for user table shoot down on non-INVPCID machine.
Minor editings.

When reloading CR3 for shoot down, do not set the save bit.

509 ↗(On Diff #38261)

Need the user bit here too.

807 ↗(On Diff #38251)

Could you please add prototypes like in the functions above and in cpu_switch.S?

kib marked 2 inline comments as done.

Add comments to support.S pmap_pti functions.
One more missed USER_PT bit.

Do not call pmap_pti_pcid_invalidate() for pmaps which do not have usermode page tables.

807 ↗(On Diff #38251)

Hm, so isn't this unsafe wrt NMIs?

632 ↗(On Diff #38265)

Missing user bit. :)

635 ↗(On Diff #38265)

Extra newline.

807 ↗(On Diff #38251)

Never mind, sorry. I forgot that the NMI handler reloads from KCR3.

kib marked 2 inline comments as done.Jan 21 2018, 10:32 AM
kib added inline comments.
807 ↗(On Diff #38251)

Yes, there are many more places where kernel is executing with UCR3, and NMI/MCE handlers already handle that.

Mark another finding of missed USER_PT.
Update pcpu kcr3 and ucr3 only after shootdown.
Add some comments.

1692 ↗(On Diff #38273)

Stray blank line.

1758 ↗(On Diff #38273)

Stray blank line.

For PCID+non-INVPCID, only reload kcr3 from pmap in invalidation IPI handler when the pmap is current. Otherwise we end up with invalid translation loaded for usermode part.
Check for valid ucr3 before committing to do the userspace part invalidation.

Diff 38279 survived full make -j 32 tinderbox run on ivy bridge machine (PCID but no INVPCID).

I can't see any problems in this version.

188 ↗(On Diff #38279)

Is there a reason for not having a separate PTI entry point, as you do for invlpg and invlrng?

472 ↗(On Diff #38279)

"achieve it" or "effect it" sound more natural than "reach it" to me.

634 ↗(On Diff #38279)

Logically it feels like pmap_pti_pcid_invl{pg,rng}() should be setting CR3_PCID_SAVE rather than having it be set in the caller. It would also make the non-INVPCID handlers more consistent.

7351 ↗(On Diff #38279)

There's a stray space here.

This revision is now accepted and ready to land.Jan 22 2018, 4:14 PM
472 ↗(On Diff #38279)

"... to clear kernel mappings from the TLB in the ..."

Also, I'm confused. The "if" expression is "!=". Shouldn't it be "==".

kib marked 3 inline comments as done.Jan 22 2018, 5:32 PM
kib added inline comments.
634 ↗(On Diff #38279)

I did not the bit set in assembly to keep the assembler source as small as possible. If you consider this important, I will change the asm.

Note that for the support.S' reader, having CR3_PCID_SAVE set in the caller is more natural (IMHO).

kib marked an inline comment as done.Jan 22 2018, 5:34 PM

Regularize IPI handlers names.
Fix if() condition.
Update comments.

This revision now requires review to proceed.Jan 22 2018, 5:42 PM
142 ↗(On Diff #38319)

I don't think that you need a continuation line here.

475 ↗(On Diff #38319)

I fear that no one (besides us) will understand this comment.

Isn't the issue here that there are now multiple kernel page tables/address spaces, and we want the kernel mapping(s) flushed from all of them?

kib marked an inline comment as done.Jan 22 2018, 6:27 PM
kib added inline comments.
475 ↗(On Diff #38319)

No, the kernel mappings as well as user mappings are always flushed on context switch, even with PCID enabled. I have to do this because kernel mode PCID are per-pmap. So only the currently active TLB partition is to be flushed.

What I am trying to say there is that I need to know PCID of the pmap where to flush the kernel mappings, this is most likely the PCID for the current pmap. But since the function was asked to do this for the kernel pmap, to not reach out and find the corresponding PCID (can it be non-current after all ?), I use the global flush.

634 ↗(On Diff #38279)

I don't think it's important.

This revision was not accepted when it landed; it landed in state Needs Review.Jan 27 2018, 11:50 AM
Closed by commit rS328470: Use PCID to optimize PTI. (authored by kib). · Explain Why
This revision was automatically updated to reflect the committed changes.