Clang 10 -march=native kernels on znver1 emit BEXTR for APIC reads,
apparently. Decode and emulate the instruction.
Details
- Reviewers
grehan - Group Reviewers
bhyve - Commits
- rS360177: vmm(4): Decode and emulate BEXTR
Diff Detail
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 30555 Build 28300: arc lint + arc unit
Event Timeline
I’m torn. If it is actually fast on zen (and iirc, it is), it kinda makes sense. (On intel it is microcoded.) We (bhyve) pass through the BMI1 cpuid flag from host to guest. And it doesn’t use fpu registers, just GPRs.
I’m not 100% sure this implementation of the instruction is correct, though. My VM goes off the rails once the APIC code comes online (with the debug printing, it’s clearly hitting this path a lot; without debug prints, it spins all cores).
Does the mem / register access look plausibly correct? I did confirm gpr_map[] is needed, which seems a bit unfortunate given instruction encoding order. And multiple disassemblers confirmed the registers (reg=r14, rm=[rax], ~vvvv=ecx). My understanding is that gpa is precomputed for me and I don’t need to add displacement or compute it from regs. Is that right?
As a temporary optimization, I could hack the APIC code to disable BMI1 optimizations with some pragma and rebuild my VM; but that’s unappealing.
I'll have a closer look at the emulation. Might even break out Intel XED for some comparison testing.
My understanding is that gpa is precomputed for me and I don’t need to add displacement or compute it from regs. Is that right?
That's right.
I did some tests with this vs running BEXTR in user-space, and so far haven't been able to find any issues other than the above-mentioned one, which I'd imagine isn't hit by the APIC code.
What values are you seeing for start/len ?
sys/amd64/vmm/vmm_instruction_emul.c | ||
---|---|---|
1381 | The shift fails for the edge case of len = 64/start = 0. |
Thanks!
Thanks.
I did some tests with this vs running BEXTR in user-space, and so far haven't been able to find any issues other than the above-mentioned one, which I'd imagine isn't hit by the APIC code.
That's good, thanks.
What values are you seeing for start/len ?
I will have to add some debugging and get back to you. I know that the first access was a 32-bit one, so len would be at most 32.
sys/amd64/vmm/vmm_instruction_emul.c | ||
---|---|---|
1381 | Nice find, will fix. Not a case you would expect the compiler to emit BEXTR for, I think :-). |
Ok, in my testing, 100% of BEXTRs are 32-bit, start:16, len:8. This is the last few lines printed before what appears to be a poll-forever loop (3,565,378 instructions before I killed it to prevent the BEXTR log from getting too large):
MADT: Found IO APIC ID 0, Interrupt 0 at 0xfec00000 ioapic0: ver 0x11 maxredir 0x1f ioapic0: Routing external 8259A's -> intpin 0 MADT: Interrupt override: source 0, irq 2 ioapic0: Routing IRQ 0 -> intpin 2 MADT: Interrupt override: source 9, irq 9 ioapic0: intpin 9 trigger: level ioapic0: intpin 9 polarity: low lapic: Routing NMI -> LINT1 lapic: LINT1 trigger: edge lapic: LINT1 polarity: high ioapic0 <Version 1.1> irqs 0-31
The print is in ioapic_register() and immediately afterwards we do:
933 * Reprogram pins to handle special case pins (such as NMI and 934 * SMI) and disable normal pins until a handler is registered. 935 */ 936 intr_register_pic(&io->io_pic); 937 for (i = 0, pin = io->io_pins; i < io->io_numintr; i++, pin++) 938 ioapic_reprogram_intpin(&pin->io_intsrc);
Nothing in either of these functions obviously loops 3 million times, so maybe it is the caller (ACPI MADT), or something later in boot. Nothing in madt_setup_io looks loopy either, so not that.
Anyway, the value obtained from memory appears to be 0xffffffffffffffff every time. This might make sense if we aren't emulating PIC memory correctly in our userspace reads. I'm not sure if I need to use a different API than memread() or what. Will dig a little.
Next in boot we might expect to see:
ioapic0 <Version 2.0> irqs 0-23 on motherboard lapic: Divisor 2, Frequency 49887866 Hz lapic: deadline tsc mode, Frequency 2394617036 Hz cpu0 BSP: ID: 0x00000000 VER: 0x01060015 LDR: 0x00000001 DFR: 0x00000000 x2APIC: 1 lint0: 0x00010700 lint1: 0x00000400 TPR: 0x00000000 SVR: 0x000011ff timer: 0x000100ef therm: 0x00010000 err: 0x000000f0 pmc: 0x00010400 cmci: 0x000000f2 ...
(On the basis of https://dmesgd.nycbug.org/index.cgi?do=view&id=3980 .) No matter; I think the problem is the device memory emulation.
I'm not seeing how we ever register a MMIO region for LAPIC_PADDR or IOAPIC_PADDR. Do we?
I see; it seems like there is some additional instruction emulation specific to APIC accesses in sys/amd64/vmm/io, and those ranges aren't exposed to userspace's MMIO.
It's hard-coded as a special case in sys/amd64/vmm/vmm.c:vm_handle_inst_emul()
/* return to userland unless this is an in-kernel emulated device */ if (gpa >= DEFAULT_APIC_BASE && gpa < DEFAULT_APIC_BASE + PAGE_SIZE) { mread = lapic_mmio_read; mwrite = lapic_mmio_write;
Right: the in-kernel emulations aren't really set up for mixed-mode. For example, there are side-effects on reads to certain LAPIC registers.
From the testing I did with a hacked-up itest rig, the bextr code looks fine and could be dropped in to the kernel instruction emulation code. Any issues there could be picked up with KTR/Dtrace.
Yep. I've got portions of a patch to expose that up to userspace for the fallback emulation via a new ioctl and vmmapi, for D24464. Should have something soonish.
From the testing I did with a hacked-up itest rig, the bextr code looks fine and could be dropped in to the kernel instruction emulation code. Any issues there could be picked up with KTR/Dtrace.
Cool. I've got the len==64 fix in my local tree and will plan to go ahead and commit that. Thanks for all your help!