Page MenuHomeFreeBSD

x86/ioapic: Stick to cpu0 for I/O APIC pins on Hyper-V.
ClosedPublic

Authored by sepherosa_gmail.com on Sep 19 2016, 6:12 AM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Nov 20, 10:28 PM
Unknown Object (File)
Sat, Nov 16, 7:17 AM
Unknown Object (File)
Fri, Nov 15, 10:21 PM
Unknown Object (File)
Sep 29 2024, 10:45 PM
Unknown Object (File)
Sep 16 2024, 8:25 PM
Unknown Object (File)
Sep 15 2024, 6:22 PM
Unknown Object (File)
Sep 12 2024, 4:16 PM
Unknown Object (File)
Sep 8 2024, 8:39 PM
Subscribers
None

Details

Summary

Interrupt could lose on Hyper-V, if the pin was reprogrammed to destine APs.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sepherosa_gmail.com retitled this revision from to x86: Don't shuffle interrupts' target CPU on Hyper-V..
sepherosa_gmail.com updated this object.
sepherosa_gmail.com edited the test plan for this revision. (Show Details)

A couple of comments:

  1. The shuffle is going to go away (I plan to enable EARLY_AP_STARTUP by default in 12 and possibly 11.x).
  1. Is the underlying issue that you don't handle MSI-X interrupt migration correctly? If so, we have a workaround for that case for Xen that you could use which prevents all migration (not just the boot-time migration).
In D7949#165052, @jhb wrote:

A couple of comments:

  1. The shuffle is going to go away (I plan to enable EARLY_AP_STARTUP by default in 12 and possibly 11.x).
  1. Is the underlying issue that you don't handle MSI-X interrupt migration correctly? If so, we have a workaround for that case for Xen that you could use which prevents all migration (not just the boot-time migration).
  • On Hyper-V MSI-Xs are always allocated after AP started with the WIP Hyper-V PCI-bridge code for PCI passthrough/SR-IOV, so re-shuffle is actually not involved.
  • On Hyper-V, only atkbd and ata allocate interrupts from PIC (let's ignore the fdc there :), which are not performance critical at all.

The main issue, we found is that when re-shuffle happened, CAM was issuing commands to ata controller, and the ata interrupt was lost due to the reshuffle, which leaded to all kinds of weirdness later on (mainly because the disk itself is shared w/ the ata controller and the Hyper-V synthetic controller).

I also believe reshuffle causes a lot of headache for CAM (or anything depends on interrupts but running asynchronously at boot time) since various commands are being issued by CAM when reshuffle happens. I had worked on a system w/ 30+ ahci controllers, w/o specially customized 'sync' in CAM or disabling the reshuffle, interrupt (MSI in this case) losing is promised.

I am not going to mess up CAM too much this time, since as you have noted EARLY_AP will be enabled by default on 11 and 12. So I believe disabling the reshuffle on Hyper-V is the best practice (well, I also need to MFC it to 10).

In D7949#165052, @jhb wrote:

A couple of comments:

  1. The shuffle is going to go away (I plan to enable EARLY_AP_STARTUP by default in 12 and possibly 11.x).
  1. Is the underlying issue that you don't handle MSI-X interrupt migration correctly? If so, we have a workaround for that case for Xen that you could use which prevents all migration (not just the boot-time migration).
  • On Hyper-V MSI-Xs are always allocated after AP started with the WIP Hyper-V PCI-bridge code for PCI passthrough/SR-IOV, so re-shuffle is actually not involved.
  • On Hyper-V, only atkbd and ata allocate interrupts from PIC (let's ignore the fdc there :), which are not performance critical at all.

The main issue, we found is that when re-shuffle happened, CAM was issuing commands to ata controller, and the ata interrupt was lost due to the reshuffle, which leaded to all kinds of weirdness later on (mainly because the disk itself is shared w/ the ata controller and the Hyper-V synthetic controller).

So the odd thing there is that in theory the shuffle shouldn't happen during CAM device discovery. The normal device discovery happens during the interrupt hooks run at SI_SUB_INT_CONFIG_HOOKS and thread0 waits for any interrupt hooks to finish before proceeding. The CAM bus scans only clear their hook once the bus scan is finished. However, what may be happening is that GEOM decides to asynchronously kick off tasting for the various partition providers, etc. after the CAM bus scans have finished. Even then, interrupts still shouldn't be lost as interrupt migration for an I/O APIC pin is careful to first mask the pin, then update it with the new (CPU, IDT) pair. After that is done, the old (CPU, IDT) pair is released via apic_free_vector() which uses sched_bind() to move to the old CPU and leave interrupts disabled for at least one instruction to catch any "in-flight" interrupts on the old CPU.

It might be worth adding a 'DELAY(100)' or so after the sched_bind() to give any in-flight interrupts more time to hit the destination CPU (though the sched_bind() is going to go through an IPI send/ack if we weren't already running on the old CPU).

I am not going to mess up CAM too much this time, since as you have noted EARLY_AP will be enabled by default on 11 and 12. So I believe disabling the reshuffle on Hyper-V is the best practice (well, I also need to MFC it to 10).

So my concern is that if the I/O APIC emulation in Hyper-V doesn't handle shuffling, then that means that users using 'cpuset -x' to move interrupts around from the command line can also lose interrupts. In that case, I think we should disable interrupt migration on Hyper-V in general (not just in the shuffle). However, it would be helpful to know what is actually broken. Xen had a bug with MSI-X in particular such that older versions of Xen didn't handle writes to MSI-X table entries while MSI-X was enabled (but the entry was masked). That affected 'cpuset -x' as well as the boot round-robin, so for Xen we disable all MSI-X migration on hypervisors with the bug.

In D7949#165286, @jhb wrote:
In D7949#165052, @jhb wrote:

A couple of comments:

  1. The shuffle is going to go away (I plan to enable EARLY_AP_STARTUP by default in 12 and possibly 11.x).
  1. Is the underlying issue that you don't handle MSI-X interrupt migration correctly? If so, we have a workaround for that case for Xen that you could use which prevents all migration (not just the boot-time migration).
  • On Hyper-V MSI-Xs are always allocated after AP started with the WIP Hyper-V PCI-bridge code for PCI passthrough/SR-IOV, so re-shuffle is actually not involved.
  • On Hyper-V, only atkbd and ata allocate interrupts from PIC (let's ignore the fdc there :), which are not performance critical at all.

The main issue, we found is that when re-shuffle happened, CAM was issuing commands to ata controller, and the ata interrupt was lost due to the reshuffle, which leaded to all kinds of weirdness later on (mainly because the disk itself is shared w/ the ata controller and the Hyper-V synthetic controller).

So the odd thing there is that in theory the shuffle shouldn't happen during CAM device discovery. The normal device discovery happens during the interrupt hooks run at SI_SUB_INT_CONFIG_HOOKS and thread0 waits for any interrupt hooks to finish before proceeding. The CAM bus scans only clear their hook once the bus scan is finished. However, what may be happening is that GEOM decides to asynchronously kick off tasting for the various partition providers, etc. after the CAM bus scans have finished. Even then, interrupts still shouldn't be lost as interrupt migration for an I/O APIC pin is careful to first mask the pin, then update it with the new (CPU, IDT) pair. After that is done, the old (CPU, IDT) pair is released via apic_free_vector() which uses sched_bind() to move to the old CPU and leave interrupts disabled for at least one instruction to catch any "in-flight" interrupts on the old CPU.

Certain commands are sent after disks are attached asynchronously after the config_hook, AFAIR, writecache and readahead etc.

It might be worth adding a 'DELAY(100)' or so after the sched_bind() to give any in-flight interrupts more time to hit the destination CPU (though the sched_bind() is going to go through an IPI send/ack if we weren't already running on the old CPU).

I am not going to mess up CAM too much this time, since as you have noted EARLY_AP will be enabled by default on 11 and 12. So I believe disabling the reshuffle on Hyper-V is the best practice (well, I also need to MFC it to 10).

So my concern is that if the I/O APIC emulation in Hyper-V doesn't handle shuffling, then that means that users using 'cpuset -x' to move interrupts around from the command line can also lose interrupts. In that case, I think we should disable interrupt migration on Hyper-V in general (not just in the shuffle). However, it would be helpful to know what is actually broken. Xen had a bug with MSI-X in particular such that older versions of Xen didn't handle writes to MSI-X table entries while MSI-X was enabled (but the entry was masked). That affected 'cpuset -x' as well as the boot round-robin, so for Xen we disable all MSI-X migration on hypervisors with the bug.

I will check Dexuan about MSI-X migration, since PCI passthrough is kinda working (we get a working passthrough-ed ix w/ MSI-X). However, if I recall correctly, he told me MSI-X migration works w/ 'cpuset -x' (not under heavy load through). So currently I believe we only need to prevent line based interrupts migration, which really goes through IOAPIC.

And we did notice that IOAPIC on Hyper-V is not actually working 100% according to the spec 'sometimes'.

In D7949#165286, @jhb wrote:
In D7949#165052, @jhb wrote:

A couple of comments:

  1. The shuffle is going to go away (I plan to enable EARLY_AP_STARTUP by default in 12 and possibly 11.x).
  1. Is the underlying issue that you don't handle MSI-X interrupt migration correctly? If so, we have a workaround for that case for Xen that you could use which prevents all migration (not just the boot-time migration).
  • On Hyper-V MSI-Xs are always allocated after AP started with the WIP Hyper-V PCI-bridge code for PCI passthrough/SR-IOV, so re-shuffle is actually not involved.
  • On Hyper-V, only atkbd and ata allocate interrupts from PIC (let's ignore the fdc there :), which are not performance critical at all.

The main issue, we found is that when re-shuffle happened, CAM was issuing commands to ata controller, and the ata interrupt was lost due to the reshuffle, which leaded to all kinds of weirdness later on (mainly because the disk itself is shared w/ the ata controller and the Hyper-V synthetic controller).

So the odd thing there is that in theory the shuffle shouldn't happen during CAM device discovery. The normal device discovery happens during the interrupt hooks run at SI_SUB_INT_CONFIG_HOOKS and thread0 waits for any interrupt hooks to finish before proceeding. The CAM bus scans only clear their hook once the bus scan is finished. However, what may be happening is that GEOM decides to asynchronously kick off tasting for the various partition providers, etc. after the CAM bus scans have finished. Even then, interrupts still shouldn't be lost as interrupt migration for an I/O APIC pin is careful to first mask the pin, then update it with the new (CPU, IDT) pair. After that is done, the old (CPU, IDT) pair is released via apic_free_vector() which uses sched_bind() to move to the old CPU and leave interrupts disabled for at least one instruction to catch any "in-flight" interrupts on the old CPU.

Certain commands are sent after disks are attached asynchronously after the config_hook, AFAIR, writecache and readahead etc.

It might be worth adding a 'DELAY(100)' or so after the sched_bind() to give any in-flight interrupts more time to hit the destination CPU (though the sched_bind() is going to go through an IPI send/ack if we weren't already running on the old CPU).

I am not going to mess up CAM too much this time, since as you have noted EARLY_AP will be enabled by default on 11 and 12. So I believe disabling the reshuffle on Hyper-V is the best practice (well, I also need to MFC it to 10).

So my concern is that if the I/O APIC emulation in Hyper-V doesn't handle shuffling, then that means that users using 'cpuset -x' to move interrupts around from the command line can also lose interrupts. In that case, I think we should disable interrupt migration on Hyper-V in general (not just in the shuffle). However, it would be helpful to know what is actually broken. Xen had a bug with MSI-X in particular such that older versions of Xen didn't handle writes to MSI-X table entries while MSI-X was enabled (but the entry was masked). That affected 'cpuset -x' as well as the boot round-robin, so for Xen we disable all MSI-X migration on hypervisors with the bug.

I will check Dexuan about MSI-X migration, since PCI passthrough is kinda working (we get a working passthrough-ed ix w/ MSI-X). However, if I recall correctly, he told me MSI-X migration works w/ 'cpuset -x' (not under heavy load through). So currently I believe we only need to prevent line based interrupts migration, which really goes through IOAPIC.

Confirmed w/ Dexuan, cpuset -x work for PCI passthrough'ed MSI-X. So we only need to disable line interrupt migration.

sepherosa_gmail.com retitled this revision from x86: Don't shuffle interrupts' target CPU on Hyper-V. to x86/ioapic: Stick to cpu0 for I/O APIC pins on Hyper-V..
sepherosa_gmail.com updated this object.
sepherosa_gmail.com edited edge metadata.
In D7949#165286, @jhb wrote:
In D7949#165052, @jhb wrote:

A couple of comments:

  1. The shuffle is going to go away (I plan to enable EARLY_AP_STARTUP by default in 12 and possibly 11.x).
  1. Is the underlying issue that you don't handle MSI-X interrupt migration correctly? If so, we have a workaround for that case for Xen that you could use which prevents all migration (not just the boot-time migration).
  • On Hyper-V MSI-Xs are always allocated after AP started with the WIP Hyper-V PCI-bridge code for PCI passthrough/SR-IOV, so re-shuffle is actually not involved.
  • On Hyper-V, only atkbd and ata allocate interrupts from PIC (let's ignore the fdc there :), which are not performance critical at all.

The main issue, we found is that when re-shuffle happened, CAM was issuing commands to ata controller, and the ata interrupt was lost due to the reshuffle, which leaded to all kinds of weirdness later on (mainly because the disk itself is shared w/ the ata controller and the Hyper-V synthetic controller).

So the odd thing there is that in theory the shuffle shouldn't happen during CAM device discovery. The normal device discovery happens during the interrupt hooks run at SI_SUB_INT_CONFIG_HOOKS and thread0 waits for any interrupt hooks to finish before proceeding. The CAM bus scans only clear their hook once the bus scan is finished. However, what may be happening is that GEOM decides to asynchronously kick off tasting for the various partition providers, etc. after the CAM bus scans have finished. Even then, interrupts still shouldn't be lost as interrupt migration for an I/O APIC pin is careful to first mask the pin, then update it with the new (CPU, IDT) pair. After that is done, the old (CPU, IDT) pair is released via apic_free_vector() which uses sched_bind() to move to the old CPU and leave interrupts disabled for at least one instruction to catch any "in-flight" interrupts on the old CPU.

Certain commands are sent after disks are attached asynchronously after the config_hook, AFAIR, writecache and readahead etc.

It might be worth adding a 'DELAY(100)' or so after the sched_bind() to give any in-flight interrupts more time to hit the destination CPU (though the sched_bind() is going to go through an IPI send/ack if we weren't already running on the old CPU).

I am not going to mess up CAM too much this time, since as you have noted EARLY_AP will be enabled by default on 11 and 12. So I believe disabling the reshuffle on Hyper-V is the best practice (well, I also need to MFC it to 10).

So my concern is that if the I/O APIC emulation in Hyper-V doesn't handle shuffling, then that means that users using 'cpuset -x' to move interrupts around from the command line can also lose interrupts. In that case, I think we should disable interrupt migration on Hyper-V in general (not just in the shuffle). However, it would be helpful to know what is actually broken. Xen had a bug with MSI-X in particular such that older versions of Xen didn't handle writes to MSI-X table entries while MSI-X was enabled (but the entry was masked). That affected 'cpuset -x' as well as the boot round-robin, so for Xen we disable all MSI-X migration on hypervisors with the bug.

I will check Dexuan about MSI-X migration, since PCI passthrough is kinda working (we get a working passthrough-ed ix w/ MSI-X). However, if I recall correctly, he told me MSI-X migration works w/ 'cpuset -x' (not under heavy load through). So currently I believe we only need to prevent line based interrupts migration, which really goes through IOAPIC.

Confirmed w/ Dexuan, cpuset -x work for PCI passthrough'ed MSI-X. So we only need to disable line interrupt migration.

Please review the new patch. We now stick all IOAPIC pins to cpu0 on Hyper-V, which should have the same effect for Hyper-V as the previous patch.

sys/x86/x86/io_apic.c
416 ↗(On Diff #20733)

I think instead this should fail with an error instead so that the user gets an error when using 'cpuset -x' rather than having the request silently ignored. This is similar to the 'msix_disable_migration' test we use for Xen.

/* Leave I/O APIC pins routed to the boot CPU on Hyper-V. */
if (vm_guest == VM_GUEST_HV)
    return (EINVAL);

Disallow destination cpu change for Hyper-V's I/O APIC, suggested by: jhb

jhb edited edge metadata.
This revision is now accepted and ready to land.Sep 29 2016, 5:36 PM
This revision was automatically updated to reflect the committed changes.