Page MenuHomeFreeBSD

sys/amd64/vmm: fix Windows 11 deadlock
AbandonedPublic

Authored by corvink on Sep 29 2025, 9:41 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Nov 24, 4:25 PM
Unknown Object (File)
Sat, Nov 15, 7:37 PM
Unknown Object (File)
Sat, Nov 15, 7:36 PM
Unknown Object (File)
Sat, Nov 15, 7:34 PM
Unknown Object (File)
Sat, Nov 15, 6:38 PM
Unknown Object (File)
Sat, Nov 8, 5:14 PM
Unknown Object (File)
Mon, Nov 3, 2:46 AM
Unknown Object (File)
Fri, Oct 31, 10:16 AM

Details

Reviewers
markj
jhb
kib
Group Reviewers
bhyve
Summary
There's a race condition which is so far only triggered by Windows 11 when
using PCI passthrough. It usually causes the Windows 11 guest to deadlock on
boot. When a vCPU triggers a rendezvous, it waits for all other vCPUs to
complete the rendezvous. At that same moment, a different vCPU could try to
lock all other vCPUs by calling vcpu_lock_all. In that case, both vCPU are
waiting for each other to complete their request (one vCPU hangs in
vm_handle_rendezvous while the other hangs in vcpu_set_state_locked). In order
to solve this, the locking vCPU has to take part in the rendezvous.
Unfortunately, we can't check in vcpu_set_state_locked on which vCPU we're
currently running. For that reason, we're iterating over all vCPUs and helping
them to complete the rendezvous.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 67575
Build 64458: arc lint + arc unit

Event Timeline

sys/amd64/vmm/vmm.c
1025

You could do something like this:

if (vcpu->vm->r_f != NULL) {
      vcpu_unlock();
      if (vcpu->vm->r_f != NULL) {
             .....
      }
 }

in other words, check the condition without the unlocking. I am not even sure that re-check after unlock is needed, since your patch checks only after unlock.

I have tested this patch on stable/14 (specifically, d0e8c94631267f07c90465faf5b47dcce0e232be) and I was finally able to pass through a PCIe device to my Windows 11 guest without it locking up during boot.

sys/amd64/vmm/vmm.c
1025

Yeah, it's better to lock and unlock in the if condition. Rechecking after unlock shouldn't be required because vm_handle_rendezvous will recheck it anyways.

crowston_protonmail.com added inline comments.
sys/amd64/vmm/vmm.c
1026

In other places we only load the vm->rendezvous_func variable under the mtx_lock(&vm->rendezvous_mtx). Are we ok to look at it here without holding that lock? Is the vcpu lock sufficient?

This is a much simpler change than I had proposed. I like it.

corvink marked an inline comment as not done.Oct 1 2025, 7:21 PM
corvink added inline comments.
sys/amd64/vmm/vmm.c
1026

We're just checking it here. We don't make use of it. If it's a false positive, vm_handle_rendezvous will recheck it again with rendezvous_mtx locked and exits early if it's NULL. So, it wouldn't lead to wrong behavior.

sys/amd64/vmm/vmm.c
1025

Btw. i've noticed that it doesn't seem necessary to call vcpu_unlock at all. However, I'm not sure whether it's a good idea to keep vcpu locked while calling vm_handle_rendezvous.

It may be possible that the issues I've been experiencing on Windows 10 Pro (22H2) could also be related to this as well. Although my VM has been running fine for the past 5 days (no reboots). I'm going to apply this patch on my server (running FreeBSD 14.3-RELEASE) and will test the stability of it on Windows 10 Pro (my main gaming VM) which has been experiencing similar issues. I'll also test it on Windows 11 Pro (25H2) running in bhyve (all requirements bypassed).

I applied the patch on top of FreeBSD 14.3-RELEASE and tested the Windows 11 25H2 VM. It ran for about 1h34m before I got my first "WHEA_UNCORRECTABLE_ERROR" crash. On the next boot it lasted about 15m before it crashed with the same error. It may have possibly improved things given before it would crash more frequently, but there either might be a different issue, or another edge case. I'm going to test my Windows 11 24H2 VM which I already know how it normally behaves (or how often it "should" be crashing). I'll report back when I get more data.

I also forgot to mention that I have both Windows 10 and 11 using the bhyve nvme, not virtio.. and I do notice that the NVME is running at 100% in the windows task manager (since I'm installing games on it and they take a while to install). I do remember reading somewhere else that some people were experiencing some crashes with the nvme driver but that was many years ago and it has been fixed.

before I got my first "WHEA_UNCORRECTABLE_ERROR" crash.

I'm not a vmm dev but that sounds like an unrelated/different issue to me.
I have been running both Windows 10 and Windows 11 guests for years and never encountered an issue other than the PCIe passthru boot deadlock issue that this patch fixes.

You might be best off raising a separate PR.

I think there may still be a relationship, jbo, since I've been having a different experience than you. However for now I'll try out some other bisecting before coming back to this. I'm going to test out switching out the "nvme" emulation for my drive to "ahci-hd" and see if that has any stability effect. I know people have also experienced issues with the nvme emulation before.

At that same moment, a different vCPU could try to lock all other vCPUs.

Is this referring to vcpu_lock_all()? Does the deadlock only happen early during guest boot?

sys/amd64/vmm/vmm.c
997

Please keep function declarations grouped together at the beginning of the file, e.g., together with vcpu_notify_event_locked().

1025

We cannot keep the vcpu locked while calling vm_handle_rendezvous(): that function acquires a MTX_DEF mutex, and the vcpu mutex is a spin mutex (which holds interrupts disabled).

I agree that we should avoid unlocking the vcpu unnecessarily here. vcpu_set_state_locked() is called many thousands of times per second and we should avoid unnecessary work. So something like this:

if (__predict_false(atomic_load_ptr(&vcpu->vm_rendezvous_func) != NULL)) {
    vcpu_unlock(vcpu);
    ...
    vcpu_lock(vcpu);
}

More generally though, is it really safe to drop the vcpu lock here? I guess so: vcpu_set_state_locked() is currently always called from vcpu_set_state(), which locks the vcpu directly, or vcpu_require_state_locked(), which is always called immediately before or after msleep_spin(&vcpu->mtx). This deserves a comment though.

1035

Shouldn't we recheck vcpu->state after reacquiring the lock? It might have transitioned to VCPU_IDLE, in which case we should not sleep.

  • only unlock vcpu when necessary
  • avoid uneccessary sleeps
  • fix some style issues

I think there may still be a relationship, jbo, since I've been having a different experience than you. However for now I'll try out some other bisecting before coming back to this. I'm going to test out switching out the "nvme" emulation for my drive to "ahci-hd" and see if that has any stability effect. I know people have also experienced issues with the nvme emulation before.

I don't think that it's related. This patch fixes a deadlock on guest boot. It's not related to BSODs.

At that same moment, a different vCPU could try to lock all other vCPUs.

Is this referring to vcpu_lock_all()? Does the deadlock only happen early during guest boot?

So far, I've only seen it deadlocking on boot.

Thanks for that context Corvin. I was interpreting this code as a general vcpu synchronization fix that happens to target the deadlock at boot time, since I didn't see any conditional logic of it specifically written for boot and/or Windows 11. Sorry for the noise!

sys/amd64/vmm/vmm.c
1034

We should assert that vm_handle_rendezvous() returns 0.

1035

It would be better to introduce a new helper, vm_handle_rendezvous_all(), which effectively implements this loop, but without acquiring and dropping the rendezvous mutex for each vcpu.

That is,

  1. add vm_handle_rendezvous_locked(), which is called with the rendezvous lock already acquired,
  2. add vm_handle_rendezvous_all(), which acquires the rendezvous lock and calls vm_handle_rendezvous_locked() for each vCPU.

This way, we avoid a potential race. Suppose there are three vCPUs, 0, 1, 2, and vCPU thread 2 requests a rendezvous. vCPU thread 1 completes it and blocks. vCPU thread 0 tries to lock all vCPUs, then executes the callback for vCPU 0. Then, before it continues to vCPU 1, vCPU 2 issues another rendezvous. vCPU 0 will execute the callback for vCPU 1, then skip 2, and continue. But vCPU 2 will still be sleeping waiting for vCPU 0 to rendezvous.

I think this race is avoided if vCPU 0 does not drop the rendezvous lock while executing the loop.

Another problem though, are we sure it's safe to "steal" the rendezvous callback this way? Look at the rendezvous with vlapic_handle_init(), for example. vlapic_handle_init() does not acquire any locks, so there is nothing synchronizing this thread with the vCPU thread for that vCPU. What if that thread is also accessing the vlapic? The vlapic code assumes that all accesses are vCPU-local, I think.

sys/amd64/vmm/vmm.c
1035

Another problem though, are we sure it's safe to "steal" the rendezvous callback this way? Look at the rendezvous with vlapic_handle_init(), for example. vlapic_handle_init() does not acquire any locks, so there is nothing synchronizing this thread with the vCPU thread for that vCPU. What if that thread is also accessing the vlapic? The vlapic code assumes that all accesses are vCPU-local, I think.

We only have to call the rendezvous for the vCPU thread we're currently running on, e.g. if vCPU 0 requests a rendezvous and vCPU 1 calls vcpu_lock_all (and then vcpu_set_state_locked), we only have to call handle_rendezvous for vCPU 1 to resolve the deadlock. This would avoid "stealing" the rendezvous callback. So, ideally we can somehow get the vcpu struct of the current thread. Unfortunately, I don't know if that's possible.

@corvink Let me know once you'd like me to test the patch again on my machine where I can reliably reproduce the issue.

I have observed this lock up post boot.

sys/amd64/vmm/vmm.c
1026

Do we need to worry about a false negative — that is, rendezvous_func is null at the time it is checked, and a rendezvous begins instantly after we are satisfied it is null?

corvink added inline comments.
sys/amd64/vmm/vmm.c
1026

It will cause a quite big delay but msleep_spin has a timeout. So, after expiring the timeout, it should run the loop again and then sees the rendezvous.

sys/amd64/vmm/vmm.c
1035

Unfortunately, I don't know if that's possible.

I don't think it is. vcpu_lock_all() is called from the ioctl path, and some ioctls that use it are not invoked from a vCPU thread in general.

So let's take a step back and make some observations:

  • it is safe to invoke the rendezvous callback for a vCPU if the vCPU state is IDLE,
  • the state of a vCPU thread sleeping in vm_smp_rendezvous() will be FROZEN,
  • if two threads call vcpu_lock_all() at the same time, they will be serialized since they lock the vCPUs in a consistent order,
  • the deadlock arises when we try to set the state of a thread sleeping in vm_smp_rendezvous(), and that thread will stop sleeping when all /other/ vCPUs have executed their callback

So I propose the following to avoid the deadlock:

  1. use an sx lock to serialize vcpu_lock_all() (i.e., make the existing vm_slock_vcpus() an exclusive lock)
  2. in vcpu_lock_all(), lock all IDLE vCPUs first (i.e., set their state to FROZEN) and invoke the rendezvous call back for them
  3. then, lock all the remaining vCPUs

I believe this will fix the deadlock as well. Moreover, it is safe to invoke the rendezvous callback on an IDLE vCPU.

I implemented my idea in https://reviews.freebsd.org/D52968 but I'm not set up to reproduce the deadlock. Could you test it?

corvink added inline comments.
sys/amd64/vmm/vmm.c
1035

I implemented my idea in https://reviews.freebsd.org/D52968 but I'm not set up to reproduce the deadlock. Could you test it?

Will test it. Unfortunately, my test system is unavailable at the moment. It will be back in a few days.