Writes to the MSI and MSI-X capability registers and table entries
happen in vCPU threads as a result of MMIO or PCI config space
accesses, but triggering of interrupts can occur either on other vCPU
threads or in worker threads for block and network devices. Use a
mutex to ensure that each MSI or MSI-X interrupt generation uses a
consistent snapshot of a MSI-X table entry or MSI capability register
set.
Details
- Reviewers
- None
- Group Reviewers
bhyve
- not even compiled, much less tested, but a test candidate for issues reported after Linux changed its MSI writing
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 30708 Build 28438: arc lint + arc unit
Event Timeline
usr.sbin/bhyve/pci_emul.c | ||
---|---|---|
894 | May be I am missing something, but where is unlock to this? And will recursion inside pci_lintr_update() below be fine? |
usr.sbin/bhyve/pci_emul.c | ||
---|---|---|
894 | Nope, this was a rough and ready patch and I missed the unlock here and in the next function. I'll avoid the recursion as well. |
I don't think the tread/twrite routines need any locking: the updates are memory operations. Modifying an MSIx table is always going to take multiple writes so the lock can't protect the update.
Error during build below. Removed the assert and everything built cleanly. However, Ubuntu 20.04 Server live installer still fell over when > 1 vCPU
# uname -a FreeBSD hostname 13.0-CURRENT FreeBSD 13.0-CURRENT #0 r360181: Wed Apr 22 13:11:36 AEST 2020 root@hostname:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64
/usr/src/usr.sbin/bhyve/pci_emul.c:1609:9: error: implicit declaration of function 'pthread_mutex_isowned_np' is invalid in C99 [-Werror,-Wimplicit-function-declaration] assert(pthread_mutex_isowned_np(&pi->pi_intr_lock)); ^ /usr/src/usr.sbin/bhyve/pci_emul.c:1609:9: note: did you mean 'pthread_mutex_init'? /usr/include/pthread.h:236:6: note: 'pthread_mutex_init' declared here int pthread_mutex_init(pthread_mutex_t * _restrict _mutex, ^ 1 error generated. *** Error code 1 Stop.
usr.sbin/bhyve/pci_emul.c | ||
---|---|---|
39–40 | My build failed as well, I think we'll need #include <pthread_np.h> here. |
Rather than locking, I'm thinking this might be the lack of PBA emulation for MSI-x
(which in turn will require locking, though slightly different)
The relevant commit to the Linux kernel that breaks this replaces a single MSI-X table update with multiple updates in quick succession (they do a multiple step process to migrate to a new IDT entry on the current CPU first before changing CPUs). The code relies on checking the lapic IRR register, not PBA to determine if it hit a race with an interrupt arriving after the first update and before the second. This is not a definite solution as I haven't reproduced locally with some tracing to try to see when it loses interrupts. It may be that tread/twrite don't need locking as that race of interrupt delivery with updates is what Linux is trying to protect against. FreeBSD masks each table entry before writing and unmasks after to fix this race.
Hmmm, re-reading the commit log at https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6f1a4891a5928a5969c87fa5a584844c983ec823, this seems to only affect MSI and not MSI-X, and there I think the main issue is the locking for generate vs the config writes since the config write to the address/data fields is not handled atomically. Actually, see my other comment. I haven't dug, but given that it is specifically for moving an already enabled MSI that doesn't support masking, I wonder if Linux is only writing to the address / data fields but not the others.
usr.sbin/bhyve/pci_emul.c | ||
---|---|---|
953 | So now I wonder if there is an issue where we only update pi_msi.addr and pi_msi.msg_data when the control register is written and not when the address or data registers are written? |
- Add pthread_np.h.
- Don't lock table read/write.
- Recompute MSI message state for all writes to the MSI capability.
I do think Linux would avoid the song and dance if we were to implement support for MSI masking for the "plain" MSI capability. There is also the existing bug that we don't set a PBA bit and refire an MSI interrupt if it triggers while the table entry is masked. That would perhaps requiring locking for twrite at least, but not tread. Given that the updates to fields in the cap writes are generally non-atomic, I do think we probably need locking for those as well as generate so that generate always use a consistent snapshot of the capability state.
Build: OK
Ubuntu 20.04 LTS Live Server install: OK, no crashing.
Other notes: No longer getting storage errors during the install.
The latest patch seems functional for me as well:
vmrun -m 4G -c 2 -E -d /dev/zvol/system/vms/ubuntu/disk0 -I ubuntu-20.04-desktop-amd64.iso -L 192.168.1.190 -T -i ubuntu
I was having trouble at first but raising the ram size from 1GB to 4GB did the trick.
When the offending commit in Linux was first brought to my attention, I reached out to the author of the commit for their input on what specifically we might need to look at. Here is the response I received this afternoon, for posterity:
I assume the way how that patch works around non-maskable MSI problems is clear from the changelog. If not feel free to ask. My assumption vs. bhyve is (backed by the 'xx.x spurious vector' messages) that if the device raises an interrupt in an intermediate state that the hypervisor fails to register it in the IRR of the local APIC before the checks happen. Assume the MSI message moves from CPU1 vector 50 to CPU2 vector 51. msi_set_affinity() is called with interrupts disabled on CPU1. Step 1) Write vector 51 into the message Step 2) Write APICID(CPU2) into the message Step 3) Check whether vector 51 triggered on CPU1 (check APIC IRR) If set, retrigger vector 51 on CPU2 (IPI) Thanks, Thomas
Relaying a couple more messages from Thomas:
I just had a quick look at your patch. AFAICT up to now the write to the MSI config space was not serialized against interrupt delivery. But I don't see how that will cure the problem. That's what hit us on real hardware: msi_msg->address_hi = MSI_ADDR_BASE_HI; // constant msi_msg->address_lo = MSI_ADDR_BASE_LO | DESTMODE | DESTID(dest_apicid); msi_msg->data = DATA_MODE | vector; So let's assume all the constants are 0 and we use the same example, i.e. affinity changes from vector 50 CPU1 (apicid: 1) to vector 51 CPU2 (apicid 2). The original message contains: msi_msg->address_hi = 0; msi_msg->address_lo = 1; // apicid 1 msi_msg->data = 50; // vector 50 So now the new message has msi_msg->address_hi = 0; msi_msg->address_lo = 2; // apicid 2 msi_msg->data = 51; // vector 51 The write order is: address_lo, address_hi, data So we get the following states: step address_lo address_hi data 0 1 0 50 // original 1 2 0 50 // address_lo written 2 2 0 50 // address_hi written 3 2 0 51 // data written That means an interrupt raised in the device between step #1 and step #3 will end up on CPU2 vector 50. But of course as we are on CPU1 we can neither disable interrupts on CPU2 (to prevent immediate delivery) nor check the APIC IRR of CPU2. That means the interrupt will either end up as a spurious vector on CPU2 or if vector 50 on CPU2 is assigned then it will be accounted as a spurious device interrupt on the device which has CPU2 vector 50 assigned to it. With the new scheme we do: step address_lo address_hi data 0 1 0 50 // original 1 1 0 50 // address_lo written 2 1 0 50 // address_hi written 3 1 0 51 // data written 4 2 0 51 // address_lo written 5 2 0 51 // address_hi written 6 2 0 51 // data written So if the device raises an interrupt before step #3 it will go to the original vector 50 on CPU1. That's fine. We are prepared to handle that. If it raises an interrupt after step #3 and before step #6 it will go to the new vector 51 on CPU1. As we have interrupts disabled on CPU1, we can check vector 51 on the local APIC IRR. If it's set then the device raised an interrupt between step #3 and step #6 and we IPI to CPU2 vector 51 to replay it. The locking change makes sense on its own as the writes are 32bit, but you're reading 64bit values when calling vm_lapic_msi(). While correct per se, I don't see how your locking change will help with the problem at hand. The 32bit writes to the config space (MSI table) are trapped one by one. So after each write the mutex is dropped and the interrupt delivery will see the corresponding state. The new scheme has a slightly wider window than the old one because between step #3 and #4 there is more computation going on. So your MSI delivery if (pci_msi_enabled(pi) && index < pci_msi_maxmsgnum(pi)) { vm_lapic_msi(pi->pi_vmctx, pi->pi_msi.addr, pi->pi_msi.msg_data + index); } can observe the steps with or without the locking change if I understand the code correctly. Btw, MSI-X is not affected because MSI-X has mandatory masking per entry and we mask the entry before changing the affinity. Hope that helps. Thanks, Thomas
And a follow up from him:
I just read through the comments there and as I don't have an account I
use you as a proxyLinux masks as well _IF_ the table entry is MSI-X or if MSI has the
maskable property set.There are no other fields. The MSI message (ditto the MSI-X table entry)
consists solely of address and data.We cannot disable MSI accross the update either because devices then can
reroute an interrupt to the INTx pin if they do not support INTx
disable. That would just move the problem to an IOAPIC pin ...We discussed our approach back and forth with hardware people and they
confirmed that this is safe from a device POV.Hope that helps.
Thomas
@freqlabs, it is not just locking, the other change which is probably the actual issue (the locking only addresses a small race) is that previously if the Linux kernel just wrote to the Address or Data registers in the MSI (not MSI-X) capability without later writing to the MSI capability control register, bhyve failed to notice the updated Address and Data pair and kept sending interrupts to the old destination. This is because bhyve "caches" the Address and Data values internally in the struct pci_devinst and was only refreshing that cache when the control register was written. The change to msicap_cfgwrite ensures that a write to any register in the MSI capability refreshes these cached values. I assume part of the change in the upstream Linux commit is that it is now writing only the Address and Data registers without touching the control register.
And this is indeed the AHCI controller. MSI is being used in the known-working 19.10, but there is no migration going on there. Disabling the MSI capability on the controller and falling back to legacy interrupts allowed an 8-vCPU install to complete and also showed the interrupt being migrated to another vCPU during the install (from /proc/interrupts).
The MSI changes in this review seem the way forward.
I can split those out into a separate review and we can think about any locking separately. I still think there are some (small) race conditions around cfgcap writes that update multiple fields, but those races should be rare.
I've created a separate review for just the MSI change (D24593). I'll leave this one around to think about locking we may want after the MSI fix is tested and merged upstream.