Page MenuHomeFreeBSD

amd64 pmap: microoptimize local shootdowns for PCID PTI configurations
ClosedPublic

Authored by kib on Jun 27 2020, 2:25 AM.

Details

Summary

When pmap operates in PTI mode, we must reload %cr3 on return to userspace. In non-PCID mode the reload always flushes all non-global TLB entries and we take advantage of it by only invalidating the kpt TLB entries (there is no upt entries at all).

In PCID mode, we flush both kpt and upt TLB explicitly, but we can take advantage of the fact that PCID mode command to reload %cr3 includes a flag to flush/not flush target TLB. In particular, we can avoid the flush for upt, instead record that load of pc_ucr3 into %cr3 on return to usermode should be flushing. This is done by providing either all-1s or ~CR3_PCID_MASK in pc_ucr3_load_mask. The mask is automatically reset to all-1s on return to usermode.

Similarly, we can avoid flushing upt TLB on context switch, replacing it by setting pc_ucr3_load_mask. This unifies INVPCID and non-INVPCID PTI ifunc, leaving only 4 cases instead of 6. This trick is also applicable both to the TLB shootdown IPI handlers, since handlers interrupt the target thread.

But then we need to check pc_curpmap in handlers, and this would reopen the same race for INVPCID machines as was fixed in r306350 for non-INVPCID. To not introduce the same bug, unconditionally do spinlock_enter() in pmap_activate().

[This is motivated by the same VMWare TLB shootdown article, but not exactly]

Test Plan

I did very light checks, but I did run jvm on the patched kernel (before, jvms were very sensitive to TLB invalidation bugs).

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

kib requested review of this revision.Jun 27 2020, 2:25 AM

Just to make sure that I understand: the idea is to reduce the amount of work done in invalidation handlers, by deferring the flush of upt TLB entries until the interrupted thread is about to switch back to user mode?

sys/amd64/amd64/pmap.c
8937 ↗(On Diff #73734)

Assert that interrupts are disabled in pmap_activate_sw()?

8938 ↗(On Diff #73734)

I do not really understand why spinlock_*() are used instead of MD intr_disable/enable().

Just to make sure that I understand: the idea is to reduce the amount of work done in invalidation handlers, by deferring the flush of upt TLB entries until the interrupted thread is about to switch back to user mode?

I would formulate it differently, that the patch removes unneeded work from both the pmap_invalidateXXX() functions and from the shootdown IPI handlers. The observation is that for PCID+PTI configs, we must reload %cr3 when returning to usermode anyway. This reload can handle the invalidation as well, so there is no need to either reload %cr3 (when INVPCID instruction is not available) or do INVPCID in functions and handlers.

sys/amd64/amd64/pmap.c
8938 ↗(On Diff #73734)

spinlock_enter == interrupt disable + critical enter. Parts of the PCID recalculations on the context switch need just the disable of the context switch, so that we are not preempted in the middle of it. So the code asserts critical sections.

On the other hand, for the reason explained in the comment above, non-context switch use of pmap_activate() must use this trick to prevent handling IPIs during non-consistent state of the PCPU/hw.

In principle just disabling interrupts would serve, but most important case of context switch has both interrupts disabled and critical section around, so I do not want to lower assertions in the code.

I might add interrupt disable asserts.

Add asserts about interrupt enable state.
Collape pmap_activate_sw_pcid_pti1() into the caller.

In D25483#563260, @kib wrote:

Just to make sure that I understand: the idea is to reduce the amount of work done in invalidation handlers, by deferring the flush of upt TLB entries until the interrupted thread is about to switch back to user mode?

I would formulate it differently, that the patch removes unneeded work from both the pmap_invalidateXXX() functions and from the shootdown IPI handlers. The observation is that for PCID+PTI configs, we must reload %cr3 when returning to usermode anyway. This reload can handle the invalidation as well, so there is no need to either reload %cr3 (when INVPCID instruction is not available) or do INVPCID in functions and handlers.

Thanks, I see.

As I understand, for correctness, the return-from-exception code must restore the default ucr3 mask (PMAP_UCR3_NOMASK). If it fails to do so, a subsequent exception+return to usermode will use a stale mask and thus will fail to flush upt TLB entries. Is that right?

As I understand, for correctness, the return-from-exception code must restore the default ucr3 mask (PMAP_UCR3_NOMASK). If it fails to do so, a subsequent exception+return to usermode will use a stale mask and thus will fail to flush upt TLB entries. Is that right?

PMAP_UCR3_NOMASK means that no invalidation is done on reloading UCR3. For correctness, we must never return to usermode without invalidation if mask is not NOMASK, be it exception, interrupt or syscall return. But if we return with invalidation, it does not make sense to not reset the mask, since we invalidated and doing more invalidation without a reason would only spend CPU cycles.

In D25483#563300, @kib wrote:

For correctness, we must never return to usermode without invalidation if mask is not NOMASK, be it exception, interrupt or syscall return.

Ok, this is the point I was trying to clarify. Can we assert this somehow, perhaps upon entry to kernel mode while handling an interrupt?

sys/amd64/amd64/mp_machdep.c
654 ↗(On Diff #73818)

I think we can have ucr3_load_mask != NOMASK only if the thread was interrupted while in kernel mode. Is that correct?

sys/amd64/amd64/pmap.c
8832 ↗(On Diff #73818)

Perhaps write, /* See the comment in pmap_invalidate_page_pcid(). */

This revision is now accepted and ready to land.Jul 6 2020, 5:44 PM
sys/amd64/amd64/pmap.c
2531 ↗(On Diff #73818)

Can you please replace "trivial" by a phrase that describes the intended meaning of "trivial"?

8952 ↗(On Diff #73818)

"most typical" -> "most common"

8953 ↗(On Diff #73818)

-> "a context switch"

kib marked 5 inline comments as done.Jul 6 2020, 7:55 PM
In D25483#563300, @kib wrote:

For correctness, we must never return to usermode without invalidation if mask is not NOMASK, be it exception, interrupt or syscall return.

Ok, this is the point I was trying to clarify. Can we assert this somehow, perhaps upon entry to kernel mode while handling an interrupt?

There is no easy place for such assertion. We have many assembler trampolines (one for each block of 32 interrupts, two for each exception, specific handlers for vectored, timer, and shootdown IPIs) which call into individual C handlers.

The only place where it is relatively straightforward to add such assertion is at syscall entry, where we can assert that mask == NOMASK, since we either came from usermode or were context-switched.

I will think about your proposal more, but right now I do not have a good answer to it.

sys/amd64/amd64/mp_machdep.c
654 ↗(On Diff #73818)

Yes, this should be true.

kib marked an inline comment as done.

Edited comments.

This revision now requires review to proceed.Jul 6 2020, 7:56 PM
kib retitled this revision from am64 pmap: microoptimize local shootdowns for PCID PTI configurations to amd64 pmap: microoptimize local shootdowns for PCID PTI configurations.Jul 6 2020, 7:57 PM
sys/amd64/amd64/pmap.c
2770–2771 ↗(On Diff #74134)

Can't this "if" statement be moved below, outside the enclosing "if" statement? That would eliminate a duplicate "if" statement in the below "else".

8955 ↗(On Diff #74134)

"so the IPI is delayed until after ..."

kib marked 2 inline comments as done.

Simplify pmap_invalidate_all_pcid().
Grammar in the comment.

This revision is now accepted and ready to land.Jul 9 2020, 4:38 PM

I would encourage you to create a followup patch that provides an overview of the various TLB management implementations. I've got time now to help with refining the text if you can generate a first draft.

sys/amd64/amd64/mp_machdep.c
625 ↗(On Diff #74167)

Everywhere else you write this as smp_tlb_pmap == PCPU_GET(curpmap).

628 ↗(On Diff #74167)

The variable "ucr3" no longer serves to simplify this code, as it is only used once. (The below function, invlpg_invpcid_handler(), similarly does without a "ucr3" variable.)

kib marked 2 inline comments as done.Jul 10 2020, 9:31 AM
kib added inline comments.
sys/amd64/amd64/mp_machdep.c
628 ↗(On Diff #74167)

kcr3 is also not too much needed. I kept pcid since otherwith load_cr3() line becomes too unreadable.

kib marked an inline comment as done.

Simplify invltlb_pcid_handler().

This revision now requires review to proceed.Jul 10 2020, 9:34 AM
In D25483#566493, @alc wrote:

I would encourage you to create a followup patch that provides an overview of the various TLB management implementations. I've got time now to help with refining the text if you can generate a first draft.

I will do this.

Currently Peter finished testing D25510, and Mark already reviewed that patch. I will commit that first, then rebase D25483 (this one, the patches extensively conflict), and work on the overview in parallel.

In D25483#566623, @kib wrote:
In D25483#566493, @alc wrote:

I would encourage you to create a followup patch that provides an overview of the various TLB management implementations. I've got time now to help with refining the text if you can generate a first draft.

I will do this.

Currently Peter finished testing D25510, and Mark already reviewed that patch. I will commit that first, then rebase D25483 (this one, the patches extensively conflict), and work on the overview in parallel.

Okay, then I will look at D25510 in the next 24 hours.

kib added a subscriber: pho.

Rebase after IPI commit.

This revision was not accepted when it landed; it landed in state Needs Review.Jul 18 2020, 6:20 PM
This revision was automatically updated to reflect the committed changes.
head/sys/amd64/amd64/exception.S
529

This is unrelated to the original purpose of the change, but I was reading the code, so ...

Why do this store unconditionally? Why not save only after the below "je" and restore immediately after the move to %cr3? Then, you wouldn't need the SCRATCH_RAX store and load after IDTVEC(fast_syscall).

539

Reiterating the above, this can be deleted.

550

And, this can be deleted.

kib marked 3 inline comments as done.Jul 19 2020, 9:55 AM
kib added inline comments.
head/sys/amd64/amd64/exception.S
529