Paths

Table of Contentst

Differential D24549

Use a pthread mutex to protect the MSI and MSI-X state.
Needs ReviewPublic
Actions

Authored by jhb on Apr 23 2020, 5:29 PM.

Details

Reviewers

None

Group Reviewers

bhyve

Summary

Writes to the MSI and MSI-X capability registers and table entries
happen in vCPU threads as a result of MMIO or PCI config space
accesses, but triggering of interrupts can occur either on other vCPU
threads or in worker threads for block and network devices. Use a
mutex to ensure that each MSI or MSI-X interrupt generation uses a
consistent snapshot of a MSI-X table entry or MSI capability register
set.

Test Plan

not even compiled, much less tested, but a test candidate for issues reported after Linux changed its MSI writing

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 30708
Build 28438: arc lint + arc unit

Event Timeline

jhb created this revision.Apr 23 2020, 5:29 PM

Herald added a reviewer: bhyve. · View Herald TranscriptApr 23 2020, 5:29 PM

Herald added subscribers: bcran, rgrimes, imp. · View Herald Transcript

Harbormaster completed remote builds in B30688: Diff 70917.Apr 23 2020, 5:29 PM

freqlabs added a subscriber: freqlabs.Apr 23 2020, 5:38 PM

mav added a subscriber: mav.Apr 23 2020, 6:49 PM

mav added inline comments.

usr.sbin/bhyve/pci_emul.c
894	May be I am missing something, but where is unlock to this? And will recursion inside pci_lintr_update() below be fine?

jhb marked an inline comment as done.Apr 23 2020, 7:10 PM

jhb added inline comments.

usr.sbin/bhyve/pci_emul.c
894	Nope, this was a rough and ready patch and I missed the unlock here and in the next function. I'll avoid the recursion as well.

Add missing unlocks and avoid lock recursion.

Harbormaster completed remote builds in B30690: Diff 70922.Apr 23 2020, 7:11 PM

I don't think the tread/twrite routines need any locking: the updates are memory operations. Modifying an MSIx table is always going to take multiple writes so the lock can't protect the update.

jason_tubnor.net added a subscriber: jason_tubnor.net.Apr 24 2020, 5:53 AM

Error during build below. Removed the assert and everything built cleanly. However, Ubuntu 20.04 Server live installer still fell over when > 1 vCPU

# uname -a
FreeBSD hostname 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r360181: Wed Apr 22 13:11:36 AEST 2020     root@hostname:/usr/obj/usr/src/amd64.amd64/sys/GENERIC  amd64

/usr/src/usr.sbin/bhyve/pci_emul.c:1609:9: error: implicit declaration of function 'pthread_mutex_isowned_np' is invalid
      in C99 [-Werror,-Wimplicit-function-declaration]
        assert(pthread_mutex_isowned_np(&pi->pi_intr_lock));
               ^
/usr/src/usr.sbin/bhyve/pci_emul.c:1609:9: note: did you mean 'pthread_mutex_init'?
/usr/include/pthread.h:236:6: note: 'pthread_mutex_init' declared here
int             pthread_mutex_init(pthread_mutex_t * _restrict _mutex,
                ^
1 error generated.
*** Error code 1

Stop.

freqlabs added inline comments.Apr 24 2020, 5:58 AM

usr.sbin/bhyve/pci_emul.c
39–40	My build failed as well, I think we'll need `#include <pthread_np.h>` here.

Rather than locking, I'm thinking this might be the lack of PBA emulation for MSI-x
(which in turn will require locking, though slightly different)

In D24549#540160, @grehan wrote:

I don't think the tread/twrite routines need any locking: the updates are memory operations. Modifying an MSIx table is always going to take multiple writes so the lock can't protect the update.

The relevant commit to the Linux kernel that breaks this replaces a single MSI-X table update with multiple updates in quick succession (they do a multiple step process to migrate to a new IDT entry on the current CPU first before changing CPUs). The code relies on checking the lapic IRR register, not PBA to determine if it hit a race with an interrupt arriving after the first update and before the second. This is not a definite solution as I haven't reproduced locally with some tracing to try to see when it loses interrupts. It may be that tread/twrite don't need locking as that race of interrupt delivery with updates is what Linux is trying to protect against. FreeBSD masks each table entry before writing and unmasks after to fix this race.

Hmmm, re-reading the commit log at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f1a4891a5928a5969c87fa5a584844c983ec823, this seems to only affect MSI and not MSI-X, and there I think the main issue is the locking for generate vs the config writes since the config write to the address/data fields is not handled atomically. Actually, see my other comment. I haven't dug, but given that it is specifically for moving an already enabled MSI that doesn't support masking, I wonder if Linux is only writing to the address / data fields but not the others.

usr.sbin/bhyve/pci_emul.c
953	So now I wonder if there is an issue where we only update pi_msi.addr and pi_msi.msg_data when the control register is written and not when the address or data registers are written?

Add pthread_np.h.
Don't lock table read/write.
Recompute MSI message state for all writes to the MSI capability.

Harbormaster completed remote builds in B30708: Diff 70954.Apr 24 2020, 6:53 PM

I do think Linux would avoid the song and dance if we were to implement support for MSI masking for the "plain" MSI capability. There is also the existing bug that we don't set a PBA bit and refire an MSI interrupt if it triggers while the table entry is masked. That would perhaps requiring locking for twrite at least, but not tread. Given that the updates to fields in the cap writes are generally non-atomic, I do think we probably need locking for those as well as generate so that generate always use a consistent snapshot of the capability state.

Build: OK
Ubuntu 20.04 LTS Live Server install: OK, no crashing.
Other notes: No longer getting storage errors during the install.

The latest patch seems functional for me as well:

vmrun -m 4G -c 2 -E -d /dev/zvol/system/vms/ubuntu/disk0 -I ubuntu-20.04-desktop-amd64.iso -L 192.168.1.190 -T -i ubuntu

I was having trouble at first but raising the ram size from 1GB to 4GB did the trick.

When the offending commit in Linux was first brought to my attention, I reached out to the author of the commit for their input on what specifically we might need to look at. Here is the response I received this afternoon, for posterity:

I assume the way how that patch works around non-maskable MSI problems
is clear from the changelog. If not feel free to ask.

My assumption vs. bhyve is (backed by the 'xx.x spurious vector'
messages) that if the device raises an interrupt in an intermediate
state that the hypervisor fails to register it in the IRR of the local
APIC before the checks happen.

Assume the MSI message moves from CPU1 vector 50 to CPU2 vector 51.

msi_set_affinity() is called with interrupts disabled on CPU1.

Step 1) Write vector 51 into the message 

Step 2) Write APICID(CPU2) into the message

Step 3) Check whether vector 51 triggered on CPU1 (check APIC IRR)
        If set, retrigger vector 51 on CPU2 (IPI)

Thanks,

        Thomas

Relaying a couple more messages from Thomas:

I just had a quick look at your patch. AFAICT up to now the write to the
MSI config space was not serialized against interrupt delivery. But I
don't see how that will cure the problem.

That's what hit us on real hardware:

  msi_msg->address_hi = MSI_ADDR_BASE_HI;       // constant
  msi_msg->address_lo = MSI_ADDR_BASE_LO | DESTMODE |  DESTID(dest_apicid);
  msi_msg->data = DATA_MODE | vector;

So let's assume all the constants are 0 and we use the same example,
i.e. affinity changes from vector 50 CPU1 (apicid: 1) to vector 51 CPU2
(apicid 2).

The original message contains:

  msi_msg->address_hi = 0;
  msi_msg->address_lo = 1;      // apicid 1
  msi_msg->data = 50;           // vector 50

So now the new message has

  msi_msg->address_hi = 0;
  msi_msg->address_lo = 2;      // apicid 2
  msi_msg->data = 51;           // vector 51

The write order is:

  address_lo, address_hi, data

So we get the following states:

step   address_lo    address_hi        data
0      1             0                 50           // original

1      2             0                 50           // address_lo written
2      2             0                 50           // address_hi written
3      2             0                 51           // data written
   
That means an interrupt raised in the device between step #1 and step #3
will end up on CPU2 vector 50. But of course as we are on CPU1 we can
neither disable interrupts on CPU2 (to prevent immediate delivery) nor
check the APIC IRR of CPU2. That means the interrupt will either end up
as a spurious vector on CPU2 or if vector 50 on CPU2 is assigned then it
will be accounted as a spurious device interrupt on the device which has
CPU2 vector 50 assigned to it.

With the new scheme we do:

step   address_lo    address_hi        data
0      1             0                 50           // original

1      1             0                 50           // address_lo written
2      1             0                 50           // address_hi written
3      1             0                 51           // data written

4      2             0                 51           // address_lo written
5      2             0                 51           // address_hi written
6      2             0                 51           // data written

So if the device raises an interrupt before step #3 it will go to the
original vector 50 on CPU1. That's fine. We are prepared to handle that.

If it raises an interrupt after step #3 and before step #6 it will go to
the new vector 51 on CPU1. As we have interrupts disabled on CPU1, we
can check vector 51 on the local APIC IRR. If it's set then the device
raised an interrupt between step #3 and step #6 and we IPI to CPU2
vector 51 to replay it.

The locking change makes sense on its own as the writes are 32bit, but
you're reading 64bit values when calling vm_lapic_msi().

While correct per se, I don't see how your locking change will help with
the problem at hand. The 32bit writes to the config space (MSI table)
are trapped one by one. So after each write the mutex is dropped and the
interrupt delivery will see the corresponding state.

The new scheme has a slightly wider window than the old one because
between step #3 and #4 there is more computation going on.

So your MSI delivery

	if (pci_msi_enabled(pi) && index < pci_msi_maxmsgnum(pi)) {
		vm_lapic_msi(pi->pi_vmctx, pi->pi_msi.addr,
			     pi->pi_msi.msg_data + index);
	}

can observe the steps with or without the locking change if I understand
the code correctly.

Btw, MSI-X is not affected because MSI-X has mandatory masking per entry
and we mask the entry before changing the affinity.

Hope that helps.

Thanks,

        Thomas

And a follow up from him:

I just read through the comments there and as I don't have an account I
use you as a proxy

In D24549#540372, @jhb wrote:

The relevant commit to the Linux kernel that breaks this replaces a
single MSI-X table update with multiple updates in quick succession
(they do a multiple step process to migrate to a new IDT entry on the
current CPU first before changing CPUs). The code relies on checking
the lapic IRR register, not PBA to determine if it hit a race with an
interrupt arriving after the first update and before the second. This
is not a definite solution as I haven't reproduced locally with some
tracing to try to see when it loses interrupts. It may be that
tread/twrite don't need locking as that race of interrupt delivery
with updates is what Linux is trying to protect against. FreeBSD masks
each table entry before writing and unmasks after to fix this race.

Linux masks as well _IF_ the table entry is MSI-X or if MSI has the
maskable property set.

In D24549#540372, @jhb wrote:

Hmmm, re-reading the commit log at
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f1a4891a5928a5969c87fa5a584844c983ec823,
this seems to only affect MSI and not MSI-X, and there I think the
main issue is the locking for generate vs the config writes since the
config write to the address/data fields is not handled
atomically. Actually, see my other comment. I haven't dug, but given
that it is specifically for moving an already enabled MSI that doesn't
support masking, I wonder if Linux is only writing to the address /
data fields but not the others.

There are no other fields. The MSI message (ditto the MSI-X table entry)
consists solely of address and data.

We cannot disable MSI accross the update either because devices then can
reroute an interrupt to the INTx pin if they do not support INTx
disable. That would just move the problem to an IOAPIC pin ...

We discussed our approach back and forth with hardware people and they
confirmed that this is safe from a device POV.

Hope that helps.

Thomas

virtio uses MSI-x on Linux so the MSI issue isn't directly related to that.

@freqlabs, it is not just locking, the other change which is probably the actual issue (the locking only addresses a small race) is that previously if the Linux kernel just wrote to the Address or Data registers in the MSI (not MSI-X) capability without later writing to the MSI capability control register, bhyve failed to notice the updated Address and Data pair and kept sending interrupts to the old destination. This is because bhyve "caches" the Address and Data values internally in the struct pci_devinst and was only refreshing that cache when the control register was written. The change to msicap_cfgwrite ensures that a write to any register in the MSI capability refreshes these cached values. I assume part of the change in the upstream Linux commit is that it is now writing only the Address and Data registers without touching the control register.

And this is indeed the AHCI controller. MSI is being used in the known-working 19.10, but there is no migration going on there. Disabling the MSI capability on the controller and falling back to legacy interrupts allowed an 8-vCPU install to complete and also showed the interrupt being migrated to another vCPU during the install (from /proc/interrupts).

The MSI changes in this review seem the way forward.

In D24549#540969, @grehan wrote:

And this is indeed the AHCI controller. MSI is being used in the known-working 19.10, but there is no migration going on there. Disabling the MSI capability on the controller and falling back to legacy interrupts allowed an 8-vCPU install to complete and also showed the interrupt being migrated to another vCPU during the install (from /proc/interrupts).

The MSI changes in this review seem the way forward.

I can split those out into a separate review and we can think about any locking separately. I still think there are some (small) race conditions around cfgcap writes that update multiple fields, but those races should be rare.

I've created a separate review for just the MSI change (D24593). I'll leave this one around to think about locking we may want after the MSI fix is tested and merged upstream.