Page MenuHomeFreeBSD

vmm: permit some IPIs to be handled by userspace
ClosedPublic

Authored by corvink on Jun 28 2022, 8:49 AM.
Tags
Referenced Files
Unknown Object (File)
Thu, Dec 19, 1:04 PM
Unknown Object (File)
Wed, Dec 11, 12:28 AM
Unknown Object (File)
Tue, Dec 10, 12:33 AM
Unknown Object (File)
Mon, Dec 9, 1:43 AM
Unknown Object (File)
Sun, Dec 8, 1:04 PM
Unknown Object (File)
Sat, Dec 7, 11:46 AM
Unknown Object (File)
Fri, Dec 6, 8:57 AM
Unknown Object (File)
Wed, Dec 4, 9:22 PM

Details

Summary

Add VM_EXITCODE_IPI to permit returning unhandled IPIs to userland.
INIT and Startup IPIs are now returned to userland. Due to backward
compatibility reasons, a new capability is added for enabling
VM_EXITCODE_IPI.

This patch requires following patches to work properly:
D35621
D35622

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

The commit log here really needs some clarity. I have an attempt below that I think describes what you are trying to do:

vmm: Permit some IPIs to be handled by userspace.

Add VM_EXITCODE_IPI to permit returning unhandled IPIs to userland.
INIT and Startup IPIs are now returned to userland.

One question is does this mean that an old bhyve will now break? If so, perhaps we should add a new
capability for userland to request userland exits vs ignoring the IPIs?

sys/amd64/vmm/io/vlapic.c
1002

Probably this should use __assert_unreachable instead.

1005

I think this warrants a comment perhaps, something like:

/* ipimask is a set of vCPUs needing userland handling of the current IPI. */
In D35623#808987, @jhb wrote:

One question is does this mean that an old bhyve will now break? If so, perhaps we should add a new
capability for userland to request userland exits vs ignoring the IPIs?

Yes, it breaks old bhyve. Adding a new capability is a good idea. I'll change it.

corvink retitled this revision from bhyve: handle IPIs properly to vmm: permit some IPIs to be handled by userspace.Jul 4 2022, 6:17 AM
corvink edited the summary of this revision. (Show Details)

Probably want to adjust the commit log a bit more now for the addition of the new capability. But in general I think this is fine.

sys/amd64/vmm/intel/vmx.c
3614

Do we also want to support this new capability for vm_get_capability for completeness?

  • report VM_CAP_IPI_EXIT by getcap
corvink added a reviewer: manu.
  • add context to diff file

@jhb seems all comments have been resolved, will commit later today unless you have objections.

This revision was not accepted when it landed; it landed in state Needs Review.Sep 7 2022, 7:10 AM
This revision was automatically updated to reflect the committed changes.
  • fix race condition in cpu suspend code

So what was the race you had to fix from the previous version? (I have a huge branch I had already rebased on top of this and was hoping this would go back in without having to rebase on the reverted version, so would love to see this back in if the reported issue is fixed? On my branch on top of the original commit I was able to boot via bhyveload fine FWIW)

In D35623#836244, @jhb wrote:

So what was the race you had to fix from the previous version? (I have a huge branch I had already rebased on top of this and was hoping this would go back in without having to rebase on the reverted version, so would love to see this back in if the reported issue is fixed? On my branch on top of the original commit I was able to boot via bhyveload fine FWIW)

It was only happening with Xeon CPU iirc, Corvin is on holiday ATM, not sure when he's back.

On Xeon CPUs and when using bhyveload, vlapic_reset(vlapic2); crashes because the vcpu belonging to vlapic2 is running and you can't set (apic-)register values of running vcpus.

When a vcpu receives an INIT IPI, this patch suspends the vcpu by calling vm_suspend. vm_suspend is asynchronous. So, the vcpu can be still running when the SIPI is received. Additionally, a suspended vcpu still calls vm_run in a loop as long as it's not debugged. So, it enters the run state for a while and exits immediately. My updated patch ensures that the vcpu is in run state when it receives a SIPI. Nevertheless, it seems to work for Xeon CPUs and 14.0-CURRENT but it deadlocks on a Core CPU and 13.1. I'm still searching for a solution.

  • properly fix race condition in INIT-SIPI-SIPI emulation
sys/amd64/vmm/io/vlapic.c
64

I would maybe do a standalone commit for this change first before the main commit to say this is to prevent Linux from using broadcast INIT IPIs. Probably the other way to fix the Linux problem btw would be to make the broadcast INIT actually be an "all but self" INIT, but avoiding broadcast IPIs is probably better anyway.

1034–1035

The gross hack for Linux here perhaps would be to always remove the sending vCPU from dmask. It never makes sense for a CPU to send an INIT IPI to itself. Something like:

CPU_CLR(vlapic->vcpuid, &dmask);

If that fixes Linux without requiring the LAPIC version bump, I think I'd like to include the fix as it is more future proof (and probably more accurate to how INIT IPIs are treated). I'd probably still want to bump the APIC version as a separate commit just because we don't need to try to emulate a real P5 or PPro, but this fix above might also prevent malicious guests from causing weird behavior by sending an INIT IPI to the executing vCPU.

pmooney_pfmooney.com added inline comments.
sys/amd64/vmm/io/vlapic.c
1034–1035

When we (illumos) overhauled our LAPIC emulation to support OVMF (which was using more of INIT/SIPI than previously supported), we chose to ignore the problematic shorthand emitted by the Linux code in question. I know the overall approach here is different than what we did, but perhaps at least this tidbit is worth consideration. We've been running with it for a couple years.

sys/amd64/vmm/io/vlapic.c
1034–1035

I'd prefer to validate all IPIs. The Intel SDM contains a table of valid combinations for the local interrupt command register. So, we could check for valid combinations and ignore invalid ones (or throw an exception).

If we like to fix this without a version bump, I'd prefer to emulate it correctly. The SDM states that the correct behaviour on reception of the INIT deassert would be to write the apic id into the arb id register.

Haven't checked the AMD manuals yet.

sys/amd64/vmm/io/vlapic.c
1034–1035

I've noticed that vm_smp_rendezvous crashes on Intel Xeon CPUs. It looks like vm_smp_rendezvous calls mtx_sleep which clears the CR0_TS bit. This causes a panic in save_guest_fpustate.

Can we avoid this?

  • do not call vm_smp_rendezvous inside vmx_run
corvink added inline comments.
sys/amd64/vmm/io/vlapic.c
64
1034–1035

Sadly, neither Intel nor AMD documents what happens when someone tries to send an invalid IPI (like INIT broadcast to all including self). However, they document which ICR combinations are invalid. I've created a patch for it. Please note, that it behaves like an APIC version 0x14 or higher.

https://reviews.freebsd.org/D36946

corvink marked an inline comment as done.
  • set boot state in icrlo_write handler too. It's required for userland applications which do not support the ipi exit.
This revision is now accepted and ready to land.Oct 13 2022, 4:31 PM
This revision was automatically updated to reflect the committed changes.