Page MenuHomeFreeBSD

vmm: Fix HLT loop while vcpu has requested virtual interrupts
ClosedPublic

Authored by gusev.vitaliy_gmail.com on Apr 17 2023, 5:19 PM.
Tags
Referenced Files
Unknown Object (File)
Sat, Sep 23, 6:28 PM
Unknown Object (File)
Sat, Sep 2, 7:26 PM
Unknown Object (File)
Aug 15 2023, 3:06 PM
Unknown Object (File)
Aug 14 2023, 5:23 PM
Unknown Object (File)
Aug 14 2023, 8:01 AM
Unknown Object (File)
Jun 24 2023, 4:29 AM
Unknown Object (File)
May 23 2023, 8:34 PM
Unknown Object (File)
May 13 2023, 11:47 PM

Details

Summary

We have seen stuck of Linux VMs (kernel 5.4, 5.15, 6.x) with messages in Linux's dmesg:

  1. BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 37s!
  2. watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [kworker/3:3:2090493]
  3. rcu: INFO: rcu_sched detected stalls on CPUs/tasks:

Physical CPUs are Intel(R) Xeon(R) Gold 6248R and have enabled "Virtual Interrupt Delivery" and "Process posted interrupts" features.

Investigation shown that vmx_pending_intr returns 0 even if vmexit->u.hlt.intr_status was 0xfe, 0xfd, ... and per_desc->pending was 1.

Function vmx_inject_pir has mention about situation when 'pending' is 1 and zero pirval-s:

* It is possible for pirval to be 0 here, even though the                                                                                                          
* pending bit has been set. The scenario is:

Correct initial fix "02cc877968bbcd57695035c67114a67427f54549 Recognize a pending virtual interrupt while emulating the halt instruction" for all cases: pending is 0 and pending is 1.

Possible issue is also mentioned here: debian-vm-freezes-after-several-hours

Sponsored by: vStack

Test Plan

Check several types of VM: Linux, Windows, FreeBSD.

Verify that VM has idle time, verify that Linux VMs (RHEL9, OEL9, etc.) don't have lockups and RCU stalls anymore.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

This revision is now accepted and ready to land.Apr 18 2023, 6:11 AM

@jhb @markj Could you look at it? This fix potentially could be added to 13.2 .

What exactly is the problematic scenario? How do we end up with a pending, undelivered interrupt after a vmexit? Presumably the guest must have enabled interrupts before executing HLT.

The change itself seems reasonable to me, but I'd like to understand it better.

sys/amd64/vmm/intel/vmx.c
3785

This is section 30.2.1 now.

What exactly is the problematic scenario? How do we end up with a pending, undelivered interrupt after a vmexit? Presumably the guest must have enabled interrupts before executing HLT.

The change itself seems reasonable to me, but I'd like to understand it better.

Look at the vmx.c L4016
It has good explanation how it can occur. One more example could be w/o involving posted-interrupt due to async changes between pending and pir[]

Interrupt can be undelivered also due to not true conditions in "evaluation of pending interrupt" 30.2 of SDM.

Presumably the guest must have enabled interrupts before executing HLT.

It has already check on it. Look at vm_handle_hlt(). Problem is not in disabled interrupts during HLT. It is enabled.

gusev.vitaliy_gmail.com edited the summary of this revision. (Show Details)

Correct mention of SDM : Section 29.2.1 --> Section 30.2.1

This revision now requires review to proceed.Apr 24 2023, 4:32 PM
This revision is now accepted and ready to land.Apr 25 2023, 1:52 PM