Page MenuHomeFreeBSD

vmm/svm: post LAPIC interrupts using event injection rather than virtual interrupts

Authored by avg on Jan 6 2018, 12:32 AM.



The virtual interrupt method uses V_IRQ, V_INTR_PRIO, and V_INTR_VECTOR fields
of VMCB to inject a virtual interrupt into a guest VM. This method has many
advantages over the direct event injection as it offloads all decisions of
whether and when the interrupt can be delivered to the guest. But with a
purely software emulated vAPIC the advantage is also a problem. The problem is
that the hypervisor does not have any precise control over when the interrupt
is actually delivered to the guest (or a notification about that).
Because of that the hypervisor cannot update the interrupt vector in IRR and
ISR in the same way as real hardware would. The hypervisor becomes aware that
the interrupt is being serviced only upon the first VMEXIT after the interrupt
is delivered. This creates a window between the actual interrupt delivery and
the update of IRR and ISR.
That means that IRR and ISR might not be correctly set up to the point of the
end-of-interrupt signal.

The described deviation has been observed to cause an interrupt loss in
the following scenario.
vCPU0 posts an inter-processor interrupt to vCPU1. The interrupt is injected
as a virtual interrupt by the hypervisor. The interrupt is delivered to
a guest and an interrupt handler is invoked. The handler performs a requested
action and acknowledges the request by modifying a global variable. So far,
there is no VMEXIT and the hypervisor is unaware of the events.
Then, vCPU0 notices the acknowledgment and sends another IPI with the same
vector. The IPI gets collapsed into the previous IPI in the IRR of vCPU1.
Only after that a VMEXIT of vCPU1 occurs. At that time the vector is cleared
in the IRR and is set in the ISR. vCPU1 has vAPIC state as if the second IPI
has never been sent.
I believe that we see this scenario in bug 215972
The scenario is impossible on the real hardware because IRR and ISR are
updated just before the interrupt handler gets started.

I saw two possibilities of fixing the problem. One is to intercept the
virtual interrupt delivery to update IRR and ISR at the right moment.
The other is to deliver the LAPIC interrupts using the event injection,
same as the legacy interrupts.
I opted to use the latter approach for several reasons. It's equivalent
to what VMM/Intel does (in !VMX case). It appears to be what VirtualBox
and KVM do. The code is already there (to support legacy interrupts).

Please see sections 15.20 and 15.21.4 of "AMD64 Architecture Programmer's
Manual Volume 2: System Programming" (publication 24593, revision 3.29)
for comparison between event injection and virtual interrupt injection.

Test Plan

I have a test-case that could easily provoke a deadlock described in
on Phenom II X6 1090T.
After this change the deadlock is no longer reproducible.
I haven't seen any regressions so far.

Diff Detail

rS FreeBSD src repository
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

avg created this revision.Jan 6 2018, 12:32 AM
avg edited the summary of this revision. (Show Details)Jan 6 2018, 12:39 AM
avg added a reviewer: tychon.
anish added a comment.Jan 7 2018, 2:34 AM

Nice, could install Windows 10 with multiple vcpus.

anish added inline comments.Jan 7 2018, 4:25 AM
1668–1670 ↗(On Diff #37575)

Enabling AVIC disables V_IRQ in VMCB, section

"Enabling AVIC implicitly disables the V_IRQ, V_INTR_PRIO, V_IGN_TPR, and V_INTR_VECTOR fields in the VMCB Control Word."

In that case lines 1741-1755 are also not required.

grehan added a comment.Jan 8 2018, 7:34 AM

I didn't have success with this on my Ryzen 1700, but it's also possibly buggy :( I'll try with an Opteron tomorrow.

Windows 10 Enterprise still hung during install with 2 or 4 vCPUs, and the short test case, FreeBSD 10.3/amd64 install, still hung.

Here's the FreeBSD test case:

 bhyveload -d ./FreeBSD-10.3-RELEASE-amd64-disc1.iso -m 4G crash
bhyve \
  -c 6 \
  -s 0,hostbridge \
  -s 3,ahci-cd,FreeBSD-10.3-RELEASE-amd64-disc1.iso \
  -s 31,lpc \
  -l com1,stdio \
  -H -A -w -m 4G \

Select the shell prompt, and then issue:

sha256 -q /usr/freebsd-dist/*txz
grehan added a comment.Jan 8 2018, 7:37 AM

Also, I think CR8 write exits are needed to force re-evaluation of the TPR for the software APIC model.

avg added inline comments.Jan 8 2018, 9:52 PM
1668–1670 ↗(On Diff #37575)

Thank you for pointing this out!
I will update the patch to remove the code conditional on avic and pending_apic_vector.

avg added a comment.Jan 8 2018, 9:56 PM

I didn't have success with this on my Ryzen 1700, but it's also possibly buggy :(


However, I should add that on my system with Phenom II X6 1090T your test case reliably triggers the problem without the patch:

# sha256 -q /usr/freebsd-dist/*txz
spin lock 0xffffffff816a6200 (smp rendezvous) held by 0xfffff80004b95000 (tid 100062) too long
panic: spin lock held too long
cpuid = 4
KDB: stack backtrace:
#0 0xffffffff8098e390 at kdb_backtrace+0x60
#1 0xffffffff80951066 at vpanic+0x126
#2 0xffffffff80950f33 at panic+0x43
#3 0xffffffff80937097 at _mtx_lock_spin_cookie+0x287
#4 0xffffffff80d44cd1 at smp_tlb_shootdown+0x91
#5 0xffffffff80d46b7c at pmap_invalidate_range+0x2fc
#6 0xffffffff809df84f at vfs_vmio_release+0x2f
#7 0xffffffff809e02d2 at getnewbuf+0x482
#8 0xffffffff809dd6b1 at getblk+0x571
#9 0xffffffff80babf00 at ffs_sbupdate+0x70
#10 0xffffffff80baf052 at ffs_sync+0x662
#11 0xffffffff809fb426 at sync_fsync+0x136
#12 0xffffffff80e81ac7 at VOP_FSYNC_APV+0xa7
#13 0xffffffff809fbe1b at sched_sync+0x3ab
#14 0xffffffff8091a4ea at fork_exit+0x9a
#15 0xffffffff80d3be0e at fork_trampoline+0xe
Uptime: 1m19s

While with the patch I haven't been reproduce it in a dozen runs.

Good to know that it fixes the issue on the Phenom. My Ryzen 1700 is pre-July17 so can't necessarily be trusted.

Anish - did you have success with this on your Ryzen ??

avg added a comment.Jan 8 2018, 11:13 PM

Also, I think CR8 write exits are needed to force re-evaluation of the TPR for the software APIC model.

Not sure...
Guest updates of cr8 are reflected in V_TPR. Upon #VMEXIT the value from V_TPR is propagated to the software LAPIC (svm_update_virqinfo -> vlapic_set_cr8).
Also, writes to (software) LAPIC TPR are always emulated and there is code in svm_inj_interrupts to update V_TPR based on LAPIC TPR.

But I could be wrong as my knowledge of the relevant code and specification is very fresh and shallow.

anish added a comment.Jan 9 2018, 2:59 AM

Yes, I tested Win10 on my Ryzen box with 12 vcpus. With the patch I could install in < 5 mins.

bhyve -c 12 -m 8G -H -w \
  -s 0,hostbridge \
  -s 30,xhci,tablet \
  -s 3,ahci-cd,/usr/home/anish/Win10_1703_English_x64.iso \
  -s 4,ahci-hd,./os.img \
  -s 5,virtio-net,tap0 \
  -s 29,fbuf,tcp=,wait \
  -s 31,lpc \
  -l com1,stdio \
  -l bootrom,/usr/local/share/uefi-firmware/BHYVE_UEFI.fd \
avg updated this revision to Diff 37678.Jan 9 2018, 1:49 PM
  • remove references to AVIC as it disables virtual interrupt injection, so they cannot be used together, actually
  • remove more code that was useful only for virtual interrupt injection
avg marked 2 inline comments as done.Jan 9 2018, 1:50 PM
avg edited the summary of this revision. (Show Details)Jan 9 2018, 1:56 PM
avg edited the test plan for this revision. (Show Details)
emaste added a subscriber: emaste.Jan 9 2018, 2:26 PM

CR8 write exits are required or the same situation will occur as with VIRQ - the update written to the V_TPR needs to be acted on immediately by the APIC emulation code since an interrupt that was blocked may need to be injected.

FreeBSD and Linux don't use CR8, but Windows uses it heavily. Here's a sample from a dual vCPU Win10 guest running on an Intel Atom (note that is also why apicv is such a perf gain with VT-x - not available on that model Atom).

# dtrace -n 'fbt::vlapic_*_cr8:entry { @a[probefunc] = count(); } tick-10sec { exit(0); }'
dtrace: description 'fbt::vlapic_*_cr8:entry ' matched 3 probes
CPU     ID                    FUNCTION:NAME
  0  68190                      :tick-10sec 

  vlapic_get_cr8                                               698820
  vlapic_set_cr8                                               849133

Ideally the CR8 read would be able to use V_TPR, but I don't think it's possible to do that with SVM. I'll try and get a patch together to force CR8 exits.

avg added a comment.Jan 10 2018, 9:53 AM

Oh, I didn't think about that.
I think that that can cause an extra interrupt latency but not an interrupt loss, but not sure.

avg added a comment.Jan 10 2018, 6:09 PM

Please see D13828 as well.

anish added inline comments.Jan 13 2018, 8:06 PM
928–930 ↗(On Diff #37678)

This doesn't look correct, Virtual interrupt is still used for interrupt window exiting, see enable_intr_window_exiting().

What has changed is now the lAPIC interrupts are being injected using event injection rather than virtual interrupt.

avg added inline comments.Jan 14 2018, 11:06 AM
928–930 ↗(On Diff #37678)

The interrupt window code sets only v_irq and leaves v_intr_vector as zero.

Some more info on how timing-sensitive this is: the 10.3 install Ryzen insta-repro doesn't happen with a 4 vCPU guest on a 1.3Gzh Sempron 3850 APU, nor with 4/6/8 vCPU guests on an Opteron 6320 :( Also, Win10 Pro 2 vCPU guests install fine on both systems, where the Ryzen shows a lockup very quickly.

Dropping kern.hz to 50 on the Sempron seemed to show the problem a bit more: the 2nd phase of a Win10 install wasn't able to complete. With the diff, the install was able to complete. Subjectively things seemed quicker, particularly the disk-writing part of the install.

About to put in CR8 exiting to see how that goes.

anish accepted this revision.Jan 15 2018, 7:20 PM
This revision is now accepted and ready to land.Jan 15 2018, 7:20 PM
avg added a comment.Jan 16 2018, 10:50 AM

@anish Did you have a chance to look at D13828? Would you like to do that before we go with this solution?

@grehan Would you like to test D13828 as well?
I tried Windows 10 and 7 installations and a couple of boots after that and it all worked on my hardware[*].

I needed a hack for Win7 installation, because it used rep outsw (66 f3 6f) to write to a VGA I/O port (0x3c4)
and bhyve/vmm didn't handle it on my system because it lacks decode assist feature
and without that it seems to be impossible to figure the segment (see svm_handle_io).

grehan requested changes to this revision.Jan 17 2018, 6:45 AM

@avg I will look at and test D13828.

However, I'd prefer this go in first, though after the CR8 write intercept has been added (I'll try and get that done post-haste).

Reason being, it doesn't change the shared APIC code shared between SVM and VT-x, and the model is exactly the same as in VT-x which seems solid. In addition, as you mentioned previously, KVM doesn't use the VIRQ method.

However, D13828 may get around the perf issues of TPR exits and interrupt shadow, so it's worth pursuing. However, I'd like to see it get some more testing, and seeing what the perf gains are.

This revision now requires changes to proceed.Jan 17 2018, 6:45 AM
grehan accepted this revision.Jan 23 2018, 4:47 PM

I don't think I'll be able to get the CR8 stuff done as quick as I'd hoped, so I think it's fine to go ahead with this as-is. In addition, both Andriy and Anish have reported that Windows guests come up fine so the issue may not be as big a deal as I'd previously thought.

I'll do the CR8 exit handling as a follow-on review.

This revision is now accepted and ready to land.Jan 23 2018, 4:47 PM
avg added a comment.Jan 24 2018, 10:59 AM

@anish, @grehan thank you for approving.
Somehow I feel irrationally, strangely attracted to the alternative solution (even though it's more intrusive).

This revision was automatically updated to reflect the committed changes.