There's a race condition which is so far only triggered by Windows 11 when using PCI passthrough. It usually causes the Windows 11 guest to deadlock on boot. When a vCPU triggers a rendezvous, it waits for all other vCPUs to complete the rendezvous. At that same moment, a different vCPU could try to lock all other vCPUs by calling vcpu_lock_all. In that case, both vCPU are waiting for each other to complete their request (one vCPU hangs in vm_handle_rendezvous while the other hangs in vcpu_set_state_locked). In order to solve this, the locking vCPU has to take part in the rendezvous. Unfortunately, we can't check in vcpu_set_state_locked on which vCPU we're currently running. For that reason, we're iterating over all vCPUs and helping them to complete the rendezvous.
Details
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped - Build Status
Buildable 67575 Build 64458: arc lint + arc unit
Event Timeline
| sys/amd64/vmm/vmm.c | ||
|---|---|---|
| 1025 | You could do something like this: if (vcpu->vm->r_f != NULL) {
vcpu_unlock();
if (vcpu->vm->r_f != NULL) {
.....
}
}in other words, check the condition without the unlocking. I am not even sure that re-check after unlock is needed, since your patch checks only after unlock. | |
I have tested this patch on stable/14 (specifically, d0e8c94631267f07c90465faf5b47dcce0e232be) and I was finally able to pass through a PCIe device to my Windows 11 guest without it locking up during boot.
| sys/amd64/vmm/vmm.c | ||
|---|---|---|
| 1025 | Yeah, it's better to lock and unlock in the if condition. Rechecking after unlock shouldn't be required because vm_handle_rendezvous will recheck it anyways. | |
| sys/amd64/vmm/vmm.c | ||
|---|---|---|
| 1026 | In other places we only load the vm->rendezvous_func variable under the mtx_lock(&vm->rendezvous_mtx). Are we ok to look at it here without holding that lock? Is the vcpu lock sufficient? | |
| sys/amd64/vmm/vmm.c | ||
|---|---|---|
| 1026 | We're just checking it here. We don't make use of it. If it's a false positive, vm_handle_rendezvous will recheck it again with rendezvous_mtx locked and exits early if it's NULL. So, it wouldn't lead to wrong behavior. | |
| sys/amd64/vmm/vmm.c | ||
|---|---|---|
| 1025 | Btw. i've noticed that it doesn't seem necessary to call vcpu_unlock at all. However, I'm not sure whether it's a good idea to keep vcpu locked while calling vm_handle_rendezvous. | |
It may be possible that the issues I've been experiencing on Windows 10 Pro (22H2) could also be related to this as well. Although my VM has been running fine for the past 5 days (no reboots). I'm going to apply this patch on my server (running FreeBSD 14.3-RELEASE) and will test the stability of it on Windows 10 Pro (my main gaming VM) which has been experiencing similar issues. I'll also test it on Windows 11 Pro (25H2) running in bhyve (all requirements bypassed).
I applied the patch on top of FreeBSD 14.3-RELEASE and tested the Windows 11 25H2 VM. It ran for about 1h34m before I got my first "WHEA_UNCORRECTABLE_ERROR" crash. On the next boot it lasted about 15m before it crashed with the same error. It may have possibly improved things given before it would crash more frequently, but there either might be a different issue, or another edge case. I'm going to test my Windows 11 24H2 VM which I already know how it normally behaves (or how often it "should" be crashing). I'll report back when I get more data.
I also forgot to mention that I have both Windows 10 and 11 using the bhyve nvme, not virtio.. and I do notice that the NVME is running at 100% in the windows task manager (since I'm installing games on it and they take a while to install). I do remember reading somewhere else that some people were experiencing some crashes with the nvme driver but that was many years ago and it has been fixed.
I'm not a vmm dev but that sounds like an unrelated/different issue to me.
I have been running both Windows 10 and Windows 11 guests for years and never encountered an issue other than the PCIe passthru boot deadlock issue that this patch fixes.
You might be best off raising a separate PR.
I think there may still be a relationship, jbo, since I've been having a different experience than you. However for now I'll try out some other bisecting before coming back to this. I'm going to test out switching out the "nvme" emulation for my drive to "ahci-hd" and see if that has any stability effect. I know people have also experienced issues with the nvme emulation before.
At that same moment, a different vCPU could try to lock all other vCPUs.
Is this referring to vcpu_lock_all()? Does the deadlock only happen early during guest boot?
| sys/amd64/vmm/vmm.c | ||
|---|---|---|
| 997 | Please keep function declarations grouped together at the beginning of the file, e.g., together with vcpu_notify_event_locked(). | |
| 1025 | We cannot keep the vcpu locked while calling vm_handle_rendezvous(): that function acquires a MTX_DEF mutex, and the vcpu mutex is a spin mutex (which holds interrupts disabled). I agree that we should avoid unlocking the vcpu unnecessarily here. vcpu_set_state_locked() is called many thousands of times per second and we should avoid unnecessary work. So something like this: if (__predict_false(atomic_load_ptr(&vcpu->vm_rendezvous_func) != NULL)) {
vcpu_unlock(vcpu);
...
vcpu_lock(vcpu);
}More generally though, is it really safe to drop the vcpu lock here? I guess so: vcpu_set_state_locked() is currently always called from vcpu_set_state(), which locks the vcpu directly, or vcpu_require_state_locked(), which is always called immediately before or after msleep_spin(&vcpu->mtx). This deserves a comment though. | |
| 1035 | Shouldn't we recheck vcpu->state after reacquiring the lock? It might have transitioned to VCPU_IDLE, in which case we should not sleep. | |
I don't think that it's related. This patch fixes a deadlock on guest boot. It's not related to BSODs.
So far, I've only seen it deadlocking on boot.
Thanks for that context Corvin. I was interpreting this code as a general vcpu synchronization fix that happens to target the deadlock at boot time, since I didn't see any conditional logic of it specifically written for boot and/or Windows 11. Sorry for the noise!
| sys/amd64/vmm/vmm.c | ||
|---|---|---|
| 1034 | We should assert that vm_handle_rendezvous() returns 0. | |
| 1035 | It would be better to introduce a new helper, vm_handle_rendezvous_all(), which effectively implements this loop, but without acquiring and dropping the rendezvous mutex for each vcpu. That is,
This way, we avoid a potential race. Suppose there are three vCPUs, 0, 1, 2, and vCPU thread 2 requests a rendezvous. vCPU thread 1 completes it and blocks. vCPU thread 0 tries to lock all vCPUs, then executes the callback for vCPU 0. Then, before it continues to vCPU 1, vCPU 2 issues another rendezvous. vCPU 0 will execute the callback for vCPU 1, then skip 2, and continue. But vCPU 2 will still be sleeping waiting for vCPU 0 to rendezvous. I think this race is avoided if vCPU 0 does not drop the rendezvous lock while executing the loop. Another problem though, are we sure it's safe to "steal" the rendezvous callback this way? Look at the rendezvous with vlapic_handle_init(), for example. vlapic_handle_init() does not acquire any locks, so there is nothing synchronizing this thread with the vCPU thread for that vCPU. What if that thread is also accessing the vlapic? The vlapic code assumes that all accesses are vCPU-local, I think. | |
| sys/amd64/vmm/vmm.c | ||
|---|---|---|
| 1035 |
We only have to call the rendezvous for the vCPU thread we're currently running on, e.g. if vCPU 0 requests a rendezvous and vCPU 1 calls vcpu_lock_all (and then vcpu_set_state_locked), we only have to call handle_rendezvous for vCPU 1 to resolve the deadlock. This would avoid "stealing" the rendezvous callback. So, ideally we can somehow get the vcpu struct of the current thread. Unfortunately, I don't know if that's possible. | |
@corvink Let me know once you'd like me to test the patch again on my machine where I can reliably reproduce the issue.
| sys/amd64/vmm/vmm.c | ||
|---|---|---|
| 1026 | Do we need to worry about a false negative — that is, rendezvous_func is null at the time it is checked, and a rendezvous begins instantly after we are satisfied it is null? | |
| sys/amd64/vmm/vmm.c | ||
|---|---|---|
| 1026 | It will cause a quite big delay but msleep_spin has a timeout. So, after expiring the timeout, it should run the loop again and then sees the rendezvous. | |
| sys/amd64/vmm/vmm.c | ||
|---|---|---|
| 1035 |
I don't think it is. vcpu_lock_all() is called from the ioctl path, and some ioctls that use it are not invoked from a vCPU thread in general. So let's take a step back and make some observations:
So I propose the following to avoid the deadlock:
I believe this will fix the deadlock as well. Moreover, it is safe to invoke the rendezvous callback on an IDLE vCPU. I implemented my idea in https://reviews.freebsd.org/D52968 but I'm not set up to reproduce the deadlock. Could you test it? | |
| sys/amd64/vmm/vmm.c | ||
|---|---|---|
| 1035 |
Will test it. Unfortunately, my test system is unavailable at the moment. It will be back in a few days. | |