Details

Reviewers

markj
jhb
bz

Group Reviewers

bhyve

Commits

rG7148611e4f56: vmm: avoid spurious rendezvous
rG892feec2211d: vmm: avoid spurious rendezvous

Summary

A vcpu only checks if a rendezvous is ongoing or not to decide if it should handle a rendezvous. This could lead to spurios rendezvous where a vcpu tries a handle a rendezvous it isn't part of. This situation is properly handled by vm_handle_rendezvous but it could potentially degrade the performance. Avoid that by an early check if the vcpu is part of the rendezvous or not.

At the moment, rendezvous are only used to spin up application processors and to send ioapic interrupts. Spinning up application processor is done in the guest boot phase by sending INIT SIPI sequences to single vcpus. This is known to cause spurious rendezvous but only occurs in the boot phase. Sending ioapic interrupts is rare because modern guest will use msi and the rendezvous is always send to all vcpus.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

corvink created this revision.Nov 15 2022, 11:02 AM

Herald added subscribers: bcran, rgrimes, imp. · View Herald TranscriptNov 15 2022, 11:02 AM

corvink requested review of this revision.Nov 15 2022, 11:02 AM

Harbormaster completed remote builds in B48357: Diff 113135.Nov 15 2022, 11:02 AM

initialize evinfo.rmtx

Harbormaster completed remote builds in B48358: Diff 113136.Nov 15 2022, 11:04 AM

gusev.vitaliy_gmail.com added a subscriber: gusev.vitaliy_gmail.com.Nov 15 2022, 12:19 PM

gusev.vitaliy_gmail.com added inline comments.

sys/amd64/include/vmm.h
163	I think you don't need mutex here (&vm->rendezvous_mtx), because taking the mutex every time during check can be too expensive for VM with many VCPUs. So this code can introduce bottleneck. Instead I would suggest to move that engine (rendezvous_req_cpus ) to atomic operations (CPU_SET_ATOMIC, etc) and barriers, it should be fast on reading (check) and safe.

panic: acquiring blockable sleep lock with spinlock or critical section help (sleep mutex) ... I hope the crashdump worked and I'll follow-up ..

panic: acquiring blockable sleep lock with spinlock or critical section held (sleep mutex) vm rendezvous lock @ ./machine/vmm.h:325
cpuid = 3
time = 1668515758
KDB: stack backtrace:
db_trace_self_wrapper() at db_trace_self_wrapper+0x2b/frame 0xfffffe0114f3c580
vpanic() at vpanic+0x151/frame 0xfffffe0114f3c5d0
panic() at panic+0x43/frame 0xfffffe0114f3c630
witness_checkorder() at witness_checkorder+0xd3e/frame 0xfffffe0114f3c7f0
__mtx_lock_flags() at __mtx_lock_flags+0x94/frame 0xfffffe0114f3c840
vmx_run() at vmx_run+0x91f/frame 0xfffffe0114f3c9a0
vm_run() at vm_run+0x223/frame 0xfffffe0114f3caa0
vmmdev_ioctl() at vmmdev_ioctl+0x507/frame 0xfffffe0114f3cb40
devfs_ioctl() at devfs_ioctl+0xcd/frame 0xfffffe0114f3cb90
vn_ioctl() at vn_ioctl+0x131/frame 0xfffffe0114f3cca0
devfs_ioctl_f() at devfs_ioctl_f+0x1e/frame 0xfffffe0114f3ccc0
kern_ioctl() at kern_ioctl+0x202/frame 0xfffffe0114f3cd30
sys_ioctl() at sys_ioctl+0x12a/frame 0xfffffe0114f3ce00
amd64_syscall() at amd64_syscall+0x12e/frame 0xfffffe0114f3cf30
fast_syscall_common() at fast_syscall_common+0xf8/frame 0xfffffe0114f3cf30
--- syscall (54, FreeBSD ELF64, ioctl), rip = 0x260d25b0833a, rsp = 0x260db3befe58, rbp = 0x260db3beff10 ---
KDB: enter: panic

Sadly we don't dump ddb output as part of the msgbuf; so show alllocks isn't there; sorry no serial console on that one to capture..

This revision now requires changes to proceed.Nov 15 2022, 12:41 PM

implement vcpu_rendezvous_pending without a mtx

Harbormaster completed remote builds in B48363: Diff 113145.Nov 15 2022, 1:10 PM

corvink marked an inline comment as done.Nov 15 2022, 1:11 PM

gusev.vitaliy_gmail.com added inline comments.Nov 15 2022, 2:09 PM

sys/amd64/vmm/vmm.c
1447	This is a "point" of rendezvous. All vCPU threads are sitting here until the last is done. vcpu_rendezvous_pending() will never have side effect on an old `rendezvous_req_cpus` value and therefore added OR/AND logic looks excessive.
2650	In general it should work. But in terms of atomicity, you can see a little window between setting `rendezvous_req_cpus` and then setting `vm->rendezvous_func = func`. Setting order is wrong. Reader can check `rendezvous_req_cpus` and see the bit set and then see rendezvous_func is NULL. When all code is protected by mutex or spinlock, all changes in that section are seen as atomic (consistent) . But if we change data under protected mutex, but read (access) data without mutex protection, it can be inconsistent. As solution you can move ATOMIC operation after 'vm->rendezvous_func = func' assignment. That would protect ordering for data writing. Another alternative for atomic is use memory barriers: a) between changing rendezvous_func and cpuset (setting `rendezvous_func` must be first) b) after CPU_ISSET() in vcpu_rendezvous_pending Also CPU_OR_ATOMIC can be replaced by CPU_SET_ATOMIC - adding AND/OR is excessive because it is rendezvous.

use memory barriers instead of atomic operations

Harbormaster completed remote builds in B48375: Diff 113176.Nov 16 2022, 7:07 AM

corvink marked an inline comment as done.Nov 16 2022, 7:47 AM

corvink added inline comments.

sys/amd64/vmm/vmm.c
1447	There one minor mistake in your comment which causes the bug of bz. It's only guaranteed that a subset of all vcpu thread (those specified by `rendezvous_req_cpus`) are sitting here. Yeah, I thought about it and you're right. Adding atomic isn't required here.
2650	You're right but I don't understand why it should be required to use a memory barrier after CPU_ISSET() in vcpu_rendezvous_pendig. Btw: CPU_SET_ATOMIC wouldn't work properly for vms with many vcpu. It only sets a single long value. So, you can only set 64 vcpus. Look at it's implementation: #define __BIT_SET_ATOMIC(_s, n, p) \ atomic_set_long(&(p)->__bits[__bitset_word(_s, n)], \ __bitset_mask((_s), n))

corvink marked an inline comment as done.Nov 16 2022, 7:48 AM

So the problem is:

vcpu1 initiates a rendezvous with some set of vcpus
vcpu2 does not belong to the set, but observes vcpu_rendezvous_pending() (an unlocked read of vm->rendezvous_func) is true
vcpu2 calls vm_exit_rendezvous() but now vm->rendezvous_func is NULL because the rendezvous is already over, and this triggers an assertion failure

Why not simply remove the incorrect in assertion in vm_exit_rendezvous()? vm_handle_rendezvous() will do nothing if no rendezvous is pending. That is, it is harmless for vcpu2 to try to handle VM_EXITCODE_RENDEZVOUS. I think this situation is rare enough that we do not care about the cost of calling vm_exit_rendezvous() unnecessarily.

In D37390#850476, @markj wrote:

So the problem is:

vcpu1 initiates a rendezvous with some set of vcpus

vcpu2 does not belong to the set, but observes vcpu_rendezvous_pending() (an unlocked read of vm->rendezvous_func) is true

vcpu2 calls vm_exit_rendezvous() but now vm->rendezvous_func is NULL because the rendezvous is already over, and this triggers an assertion failure

?

Why not simply remove the incorrect in assertion in vm_exit_rendezvous()? vm_handle_rendezvous() will do nothing if no rendezvous is pending. That is, it is harmless for vcpu2 to try to handle VM_EXITCODE_RENDEZVOUS. I think this situation is rare enough that we do not care about the cost of calling vm_exit_rendezvous() unnecessarily.

Yes, that's the issue. It's caused by vm_handle_ipi. If the OS sends INIT to a single cpu (what most OS do), vm_handle_ipi will initiate a rendezvous for a single vcpu. Don't know if the costs for unneccessary vm_exit_rendezvous calls are noticable for vms with high vcpu counts.

In D37390#850493, @corvink wrote:

In D37390#850476, @markj wrote:

So the problem is:

vcpu1 initiates a rendezvous with some set of vcpus

vcpu2 does not belong to the set, but observes vcpu_rendezvous_pending() (an unlocked read of vm->rendezvous_func) is true

vcpu2 calls vm_exit_rendezvous() but now vm->rendezvous_func is NULL because the rendezvous is already over, and this triggers an assertion failure

?

Why not simply remove the incorrect in assertion in vm_exit_rendezvous()? vm_handle_rendezvous() will do nothing if no rendezvous is pending. That is, it is harmless for vcpu2 to try to handle VM_EXITCODE_RENDEZVOUS. I think this situation is rare enough that we do not care about the cost of calling vm_exit_rendezvous() unnecessarily.

Yes, that's the issue. It's caused by vm_handle_ipi. If the OS sends INIT to a single cpu (what most OS do), vm_handle_ipi will initiate a rendezvous for a single vcpu. Don't know if the costs for unneccessary vm_exit_rendezvous calls are noticable for vms with high vcpu counts.

Then I think the right solution is to just remove the assertion. A vCPU will only call vm_exit_rendezvous unnecessarily if it happens to be exiting anyway, so I'm quite skeptical that there is any noticeable cost. If so, then it can be addressed as a separate follow-up commit.

In D37390#850494, @markj wrote:

In D37390#850493, @corvink wrote:

In D37390#850476, @markj wrote:

So the problem is:

vcpu1 initiates a rendezvous with some set of vcpus

vcpu2 does not belong to the set, but observes vcpu_rendezvous_pending() (an unlocked read of vm->rendezvous_func) is true

vcpu2 calls vm_exit_rendezvous() but now vm->rendezvous_func is NULL because the rendezvous is already over, and this triggers an assertion failure

?

Why not simply remove the incorrect in assertion in vm_exit_rendezvous()? vm_handle_rendezvous() will do nothing if no rendezvous is pending. That is, it is harmless for vcpu2 to try to handle VM_EXITCODE_RENDEZVOUS. I think this situation is rare enough that we do not care about the cost of calling vm_exit_rendezvous() unnecessarily.

Yes, that's the issue. It's caused by vm_handle_ipi. If the OS sends INIT to a single cpu (what most OS do), vm_handle_ipi will initiate a rendezvous for a single vcpu. Don't know if the costs for unneccessary vm_exit_rendezvous calls are noticable for vms with high vcpu counts.

Then I think the right solution is to just remove the assertion. A vCPU will only call vm_exit_rendezvous unnecessarily if it happens to be exiting anyway, so I'm quite skeptical that there is any noticeable cost. If so, then it can be addressed as a separate follow-up commit.

I agree with Mark. I think the right solution is to relax the assertion and explicitly handle spurious calls to vm_exit_rendezvous().

sys/amd64/include/vmm.h
336–343	OTOH, this probably needs to be something like `CPU_EMPTY()` instead of just checking the first word, or even the check you have here of checking the cpuid might be sensible. That would be a separate followup though. Until we have more vCPUs than bits in a pointer though (which we can't until my other series lands) this isn't yet a problem. I probably should fix this in my other series actually.

corvink added inline comments.Nov 17 2022, 6:51 AM

sys/amd64/include/vmm.h
336–343	No, this shouldn't be `CPU_EMPTY`. This function should check if the vcpu identified by vcpuid has a pending rendevzous or not. So, this should only test if the bit at position vcpuid is set or not. That's exactly what `CPU_ISSET` does and it will work even with thousands of vcpus.

I've created a new revision for relaxing the rendezvous assertion: https://reviews.freebsd.org/D37417

This patch avoids spurios rendezvous. I've updated the description. Please comment if you like it or not. If you don't like it, I'll drop it.

jhb added inline comments.Nov 17 2022, 6:22 PM

sys/amd64/include/vmm.h
336–343	I had thought the current code was setting `rptr` to the cpuset, and `CPU_EMPTY()` would be a more correct version of checking for an empty mask in that case (to preserve existing semantics). I do think if we want a comment anywhere we want it here as this is the one read outside of the rendezvous_mtx. The comment could just say that the check is racy, but that's ok as if we see a stale value we will get an IPI to force an exit, and if we see a stale value with it set, the handler will ignore the spurious rendezvous.
sys/amd64/vmm/vmm.c
1443	I don't know that we need quite this long a comment. We didn't have one before for clearing rendezvous_func above.
1861	Oh, this is why the previous code worked (I had thought it was always pointing to the cpuset)
2660	We generally use `atomic_*` now instead of these. Also, what you want here instead is probably `atomic_thread_fence_rel` or you want to use `atomic_store_ptr_rel` for setting `vm->rendezvous_func`. OTOH, I don't know that you need this barrier at all since you've removed the assertion. The fact that the rendezvous handler grabs the rendezvous_mtx before checking any of these members means that we don't actually depend on any real ordering here. It probably is better to set `rendezvous_req_cpus` earlier as the existing code does now so that any vCPUs doing a VM enter concurrent with this have chance to see it and get ready for rendezvous without needing an IPI to force an exit.

corvink added inline comments.Nov 17 2022, 8:24 PM

sys/amd64/include/vmm.h
336–343	I can add a comment.
sys/amd64/vmm/vmm.c
1443	The first intention of this patch was to avoid spurious rendezvous completely (and those fixing the issue of bz). This comment explains some details how it's done. As we are going to remove the rendezvous assertion, it's not required any more to avoid spurious rendezvous completely. So, this comment isn't neccessary.
2660	Thanks for the explanation how to use barriers. You're right as we are going to remove the rendezvous assertion.

corvink mentioned this in D37417: vmm: remove unneccessary rendezvous assertion.Nov 18 2022, 11:28 AM

rebase
do not use any memory barriers

Harbormaster completed remote builds in B48436: Diff 113371.Nov 21 2022, 11:54 AM

gusev.vitaliy_gmail.com added inline comments.Nov 21 2022, 3:56 PM

sys/amd64/include/vmm.h
336–343	What if use additional check on 'rendezvous_func' ? It should be faster than use only CPU_ISSET() on every vmexit that doesn't break vmx/svm loop. Of course that optimisation makes sense if there are a lot of such vmexits. vcpu_rendezvous_pending() { return (((uintptr_t )(info->rptr)) != 0) && CPU_ISSET(vcpuid, (cpuset_t *)info->rptrset); } rptr -> rendezvous_func rptrset -> rendezvous_req_cpus

fix compile
update comment of vcpu_rendezvous_pending

Harbormaster completed remote builds in B48446: Diff 113399.Nov 22 2022, 8:09 AM

corvink added inline comments.Nov 22 2022, 8:24 AM

sys/amd64/include/vmm.h
336–343	Not sure if it's worth. CPU_ISSET is a single lookup (_cpuset[bitoffset] & bitmask). So, it should be fast. Well, I'm not sure if it's worth to commit this patch at all. Spurious rendezvous are rare and at the moment only occur when sending INIT/SIPI. OTOH, this patch is small, keeps the rendezvous simple and is a slight improvement. @jhb or @markj may have more experience on that.

jhb added inline comments.Nov 23 2022, 2:34 AM

sys/amd64/include/vmm.h
336–343	I don't think we need to overcomplicate the check with checking both things. Rendezvous should be rare. Modern guests should be using MSI interrupts and not I/O APIC interrupts in which case rendezvous will only occur during boot to start APs.
341	Perhaps more accurate to say the the races are harmless. They still exist, they are just all handled
343	We could also change the type of `info->rptr` to be `cpuset_t *` which would remove the need for casts. It's only used for this.

update comment
turn rptr into cpuset_t *

Harbormaster completed remote builds in B48463: Diff 113426.Nov 23 2022, 7:13 AM

Not sure if the current log message is the one you plan to use? It would be good to be clear that post-boot the only rendezvous used are for I/O APIC interrupts which should be rare in most guests.

zlei added a subscriber: zlei.Nov 29 2022, 7:39 AM

corvink edited the summary of this revision. (Show Details)Dec 2 2022, 1:28 PM

This revision was not accepted when it landed; it landed in state Needs Review.Feb 1 2023, 12:28 PM

Closed by commit rG892feec2211d: vmm: avoid spurious rendezvous (authored by corvink). · Explain Why

This revision was automatically updated to reflect the committed changes.

corvink added a commit: rG892feec2211d: vmm: avoid spurious rendezvous.

corvink added a commit: rG7148611e4f56: vmm: avoid spurious rendezvous.Feb 8 2023, 10:40 AM