Under certain tight race conditions, we found that the lack of a memory barrier in bhyve's virtio handling causes it to miss a NO_NOTIFY state transition on block devices, resulting in guest stall. The investigation is recorded in OS-7613. As part of the examination into bhyve's use of barriers, one other section was found to be problematic, but only on non-x86 ISAs with less strict memory ordering. That was addressed in this patch as well, although it was not at all a problem on x86.
We have only observed this issue when guests ran a specific customer workload. Without the patch, they would encounter the race/stall within hours of running. With the patch in place, their workload was able to run for days without issue. PR https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=231117 is going to test as well.
After adding #include <machine/atomic.h> it compiles fine. Virtual Machine that used to crash every few hours works stable for 3 days after switching to patched bhyve. Can we have this merged? Thanks for your work!
FWIW, it happened on AMD Epyc with 16 cores (32 threads).