Page MenuHomeFreeBSD

Make pmap_invalidate_ept() wait synchronously for guest exits
ClosedPublic

Authored by markj on Oct 22 2020, 7:40 PM.
Tags
None
Referenced Files
F105772399: D26910.diff
Fri, Dec 20, 11:59 AM
Unknown Object (File)
Fri, Dec 13, 4:49 PM
Unknown Object (File)
Nov 14 2024, 9:26 PM
Unknown Object (File)
Nov 5 2024, 5:20 AM
Unknown Object (File)
Nov 2 2024, 9:32 AM
Unknown Object (File)
Nov 1 2024, 1:24 AM
Unknown Object (File)
Oct 18 2024, 5:34 AM
Unknown Object (File)
Oct 4 2024, 11:23 PM
Subscribers

Details

Summary

Currently EPT TLB invalidation is done by incrementing a generation
counter and issuing an IPI to all CPUs currently running vCPU threads.
The VMM inner loop caches the most recently observed generation on each
host CPU and invalidates TLB entries before executing the VM if the
cached generation number is not the most recent value.
pmap_invalidate_ept() issues IPIs to force each vCPU to stop executing
guest instructions and reload the generation number. However, it does
not actually wait for vCPUs to exit, potentially creating a window where
guests are referencing stale TLB entries. This change attempts to fix
that using SMR.

The idea is to bracket guest execution with an SMR read section which is
entered before loading the generation counter. Then,
pmap_invalidate_ept() increments the current write sequence before
loading pm_active and sending IPIs, and polls readers to ensure that all
vCPUs potentially operating with stale TLB entries have exited before
pmap_invalidate_ept() returns.

I left the erratum383 workaround as-is; it also increments pm_eptgen but
uses an SMP rendezvous to ensure that guest CPUs exit before returning.
The workaround is relevant only for fairly old hardware and I have no
means to test changes to it.

I also annotated some loads of pm_eptgen using atomic_load_long().
Otherwise I'm not sure that the code is safe in the face of compiler
reordering.

Another, simpler, solution would be to replace the ipi_selected() call
in pmap_invalidate_ept() with an SMP rendezvous. However,
smp_rendezvous_cpus() doesn't scale particularly well today: it is
serialized using a spin mutex (and so interrupts are disabled while
waiting for acknowledgements), and all CPUs must increment a global
variable in the IPI handler. This could be fixed. However, the use of
SMR also allows the initiator to return earlier since it only needs to
wait until vCPUs have exited. Finally having a general mechanism to
wait for vCPUs to exit (with or without an IPI) may be useful in other
contexts.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 34507
Build 31605: arc lint + arc unit

Event Timeline

markj requested review of this revision.Oct 22 2020, 7:40 PM
sys/amd64/amd64/pmap.c
2754–2756

I'm not sure why _acq is used here. The increment is atomic so other instructions cannot be reordered with it.

As a side note, I think some of our atomic_*_acq_* atomics are incorrect on at least arm64. For example, atomic_add_acq_long() without LSE results in an exclusive load-acquire and exclusive store pair, and nothing disallows reordering of subsequent loads with the store, contrary to the expectations of some consumers. The atomic-ordered-before property described in the ARM ARM only appears to prevent reordering of a subsequent load-acquire with a preceding exclusive store.

sys/amd64/vmm/amd/svm.c
2087

Why is this using _ACQ? I can't see a reason. VMX uses plain CPU_SET_ATOMIC, though up until recently it was implemented in assembly.

This revision is now accepted and ready to land.Oct 24 2020, 9:32 AM

Would it be possible to cache the smr_seq_t instead of pmap->pm_eptgen ? One less atomic to increment.

sys/amd64/amd64/pmap.c
2754–2756

I think it was just paranoia.

sys/amd64/vmm/amd/svm.c
2087

Same paranoia.

Would it be possible to cache the smr_seq_t instead of pmap->pm_eptgen ? One less atomic to increment.

I thought about it a bit but wanted to keep the initial change simple since there are other places where we increment the eptgen. One problem with using the sequence number alone is that it's 32 bits wide. smr_advance() will block if it gets close to wraparound, but that only checks active readers.

sys/amd64/amd64/pmap.c
2754–2756

In regards to the side note, can you point to an example or two through email? I suspect that the real issue is that these examples are using the wrong fencing. I've seen people express a desire for a store acquire-like operation, but that is an utterly bogus notion under the release consistency model.

Drop uses of _acq in updates of pm_eptgen and pm_active.

This revision now requires review to proceed.Oct 30 2020, 5:55 PM
This revision is now accepted and ready to land.Nov 2 2020, 7:57 PM