Paths

Table of Contentst

vmm: Fix HLT loop while vcpu has requested virtual interrupts
ClosedPublic
Actions

Authored by gusev.vitaliy_gmail.com on Apr 17 2023, 5:19 PM.

Details

Reviewers

corvink
jhb
markj
rgrimes
grehan
tychon
mav

Group Reviewers

bhyve

Commits

rG743938876959: vmm: fix HLT loop while vcpu has requested virtual interrupts
rG0912408a281f: vmm: fix HLT loop while vcpu has requested virtual interrupts

Summary

We have seen stuck of Linux VMs (kernel 5.4, 5.15, 6.x) with messages in Linux's dmesg:

BUG: workqueue lockup - pool cpus=1 node=0 flags=0x0 nice=0 stuck for 37s!
watchdog: BUG: soft lockup - CPU#3 stuck for 26s! [kworker/3:3:2090493]
rcu: INFO: rcu_sched detected stalls on CPUs/tasks:

Physical CPUs are Intel(R) Xeon(R) Gold 6248R and have enabled "Virtual Interrupt Delivery" and "Process posted interrupts" features.

Investigation shown that vmx_pending_intr returns 0 even if vmexit->u.hlt.intr_status was 0xfe, 0xfd, ... and per_desc->pending was 1.

Function vmx_inject_pir has mention about situation when 'pending' is 1 and zero pirval-s:

* It is possible for pirval to be 0 here, even though the                                                                                                          
* pending bit has been set. The scenario is:

Correct initial fix "02cc877968bbcd57695035c67114a67427f54549 Recognize a pending virtual interrupt while emulating the halt instruction" for all cases: pending is 0 and pending is 1.

Possible issue is also mentioned here: debian-vm-freezes-after-several-hours

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

gusev.vitaliy_gmail.com created this revision.Apr 17 2023, 5:19 PM

Herald added a reviewer: bhyve. · View Herald TranscriptApr 17 2023, 5:19 PM

Herald added subscribers: bcran, imp. · View Herald Transcript

gusev.vitaliy_gmail.com requested review of this revision.Apr 17 2023, 5:19 PM

afedorov added a subscriber: afedorov.Apr 17 2023, 7:17 PM

LGTM

This revision is now accepted and ready to land.Apr 18 2023, 6:11 AM

gusev.vitaliy_gmail.com edited the test plan for this revision. (Show Details)Apr 18 2023, 7:43 PM

@jhb @markj Could you look at it? This fix potentially could be added to 13.2 .

up?

What exactly is the problematic scenario? How do we end up with a pending, undelivered interrupt after a vmexit? Presumably the guest must have enabled interrupts before executing HLT.

The change itself seems reasonable to me, but I'd like to understand it better.

sys/amd64/vmm/intel/vmx.c
3785	This is section 30.2.1 now.

In D39620#905476, @markj wrote:

What exactly is the problematic scenario? How do we end up with a pending, undelivered interrupt after a vmexit? Presumably the guest must have enabled interrupts before executing HLT.

The change itself seems reasonable to me, but I'd like to understand it better.

Look at the vmx.c L4016
It has good explanation how it can occur. One more example could be w/o involving posted-interrupt due to async changes between pending and pir[]

Interrupt can be undelivered also due to not true conditions in "evaluation of pending interrupt" 30.2 of SDM.

Presumably the guest must have enabled interrupts before executing HLT.

It has already check on it. Look at vm_handle_hlt(). Problem is not in disabled interrupts during HLT. It is enabled.

Correct mention of SDM : Section 29.2.1 --> Section 30.2.1

This revision now requires review to proceed.Apr 24 2023, 4:32 PM

gusev.vitaliy_gmail.com marked an inline comment as done.Apr 24 2023, 4:32 PM

markj accepted this revision as: markj.Apr 25 2023, 1:52 PM

This revision is now accepted and ready to land.Apr 25 2023, 1:52 PM

Closed by commit rG0912408a281f: vmm: fix HLT loop while vcpu has requested virtual interrupts (authored by gusev.vitaliy_gmail.com, committed by corvink). · Explain WhyApr 26 2023, 8:39 AM

This revision was automatically updated to reflect the committed changes.

corvink added a commit: rG0912408a281f: vmm: fix HLT loop while vcpu has requested virtual interrupts.

corvink added a commit: rG743938876959: vmm: fix HLT loop while vcpu has requested virtual interrupts.May 8 2023, 8:28 AM

Hi all,

I believe I am experiencing this bug on TrueNAS 13.0-U5.2 (not quite the newest, but almost).

I'm not very familiar with phabricator, so I'm unsure: is this fixed in FreeBSD 13.2 or only 14.0?

Is there any workaround until TrueNAS is updated to a newer FreeBSD?

Many thanks.

In D39620#978503, @sean_rogue-research.com wrote:

...
I'm not very familiar with phabricator, so I'm unsure: is this fixed in FreeBSD 13.2 or only 14.0?

stable/13 has this patch
releng/13.2 doesn't have this patch (yet).

Is there any workaround until TrueNAS is updated to a newer FreeBSD?

You have reproducers in Linux, right ? Probably you could try 6.x Linux kernel. Please report if it helps.

stable/13 has this patch

OK great. TrueNAS will probably go there before they go to 14.

You have reproducers in Linux, right ? Probably you could try 6.x Linux kernel. Please report if it helps.

Assuming the bug I see is indeed this bug, I reproduce in all 3 of my linux VMs, running Ubuntu 20.04 LTS and 22.04 LTS, which are all linux kernel 5.x. Looks like there is no LTS Ubuntu with kernel 6.x until upcoming 24.04. But 23.10 uses kernel 6.5, which I could try. Is kernel 6.x expected to fix something?

In D39620#978525, @sean_rogue-research.com wrote:

Assuming the bug I see is indeed this bug, I reproduce in all 3 of my linux VMs, running Ubuntu 20.04 LTS and 22.04 LTS, which are all linux kernel 5.x. Looks like there is no LTS Ubuntu with kernel 6.x until upcoming 24.04. But 23.10 uses kernel 6.5, which I could try. Is kernel 6.x expected to fix something?

It could be, but I would expect that Linux 6.x kernel just changed behaviour a little and therefore race caused that issue is rarely seen.

It could be, but I would expect that Linux 6.x kernel just changed behaviour a little and therefore race caused that issue is rarely seen.

OK, we reproduce easily with an Ubuntu 22.04.6 VM running gitlab. It freezes every few days. I'll try to update the server to 23.10 and see what happens.

And just to link together a few things, in case it helps others (we've been chasing this bug for months, part time):

Related TrueNAS ticket: https://ixsystems.atlassian.net/browse/NAS-122108
Similar old FreeBSD ticket: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=222916
Possibly related: https://reviews.freebsd.org/rG2c352feb3bf93c679f9e41d65bc8dc8394a4fdae
Possibly related: https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=275155
TrueNAS forum thread 1: https://www.truenas.com/community/threads/debian-vm-freezes-after-several-hours.106367/#post-742515
TrueNAS forum thread 2: https://www.truenas.com/community/threads/bhyve-with-ubuntu-19-04-keeps-locking-up.79191/page-3#post-757796

stable/13 has this patch
releng/13.2 doesn't have this patch (yet).

I'm not very familiar with FreeBSD's branching system... I see FreeBSD 13.3-RELEASE was released today, is this bug fix included?

In D39620#1008905, @sean_rogue-research.com wrote:

stable/13 has this patch
releng/13.2 doesn't have this patch (yet).

I'm not very familiar with FreeBSD's branching system... I see FreeBSD 13.3-RELEASE was released today, is this bug fix included?

Yes. releng/13.3 was branched from stable/13 just about a month ago, so it sure got it.

Revision Contents
Changeset List

Path

Size

sys/

amd64/

vmm/

intel/

vmx.c

44 lines

Diff 121067

View Options

vmm: Fix HLT loop while vcpu has requested virtual interruptsClosedPublicActions