Page MenuHomeFreeBSD

vmm(4): Decode and emulate BEXTR
ClosedPublic

Authored by cem on Apr 17 2020, 4:46 AM.
Tags
None
Referenced Files
Unknown Object (File)
Mon, Jun 3, 12:32 AM
Unknown Object (File)
May 27 2024, 5:47 PM
Unknown Object (File)
Apr 15 2024, 2:58 AM
Unknown Object (File)
Feb 13 2024, 10:41 PM
Unknown Object (File)
Jan 6 2024, 11:11 PM
Unknown Object (File)
Jan 6 2024, 11:11 PM
Unknown Object (File)
Jan 6 2024, 11:11 PM
Unknown Object (File)
Jan 6 2024, 10:58 PM
Subscribers

Details

Reviewers
grehan
Group Reviewers
bhyve
Commits
rS360177: vmm(4): Decode and emulate BEXTR
Summary

Clang 10 -march=native kernels on znver1 emit BEXTR for APIC reads,
apparently. Decode and emulate the instruction.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

Clang 10 -march=native kernels on znver1 emit BEXTR for APIC reads

Ugh :(

This revision is now accepted and ready to land.Apr 17 2020, 5:14 AM

Clang 10 -march=native kernels on znver1 emit BEXTR for APIC reads

Ugh :(

I’m torn. If it is actually fast on zen (and iirc, it is), it kinda makes sense. (On intel it is microcoded.) We (bhyve) pass through the BMI1 cpuid flag from host to guest. And it doesn’t use fpu registers, just GPRs.

I’m not 100% sure this implementation of the instruction is correct, though. My VM goes off the rails once the APIC code comes online (with the debug printing, it’s clearly hitting this path a lot; without debug prints, it spins all cores).

Does the mem / register access look plausibly correct? I did confirm gpr_map[] is needed, which seems a bit unfortunate given instruction encoding order. And multiple disassemblers confirmed the registers (reg=r14, rm=[rax], ~vvvv=ecx). My understanding is that gpa is precomputed for me and I don’t need to add displacement or compute it from regs. Is that right?

As a temporary optimization, I could hack the APIC code to disable BMI1 optimizations with some pragma and rebuild my VM; but that’s unappealing.

I'll have a closer look at the emulation. Might even break out Intel XED for some comparison testing.

I'll have a closer look at the emulation. Might even break out Intel XED for some comparison testing.

Thanks!

My understanding is that gpa is precomputed for me and I don’t need to add displacement or compute it from regs. Is that right?

That's right.

I did some tests with this vs running BEXTR in user-space, and so far haven't been able to find any issues other than the above-mentioned one, which I'd imagine isn't hit by the APIC code.

What values are you seeing for start/len ?

sys/amd64/vmm/vmm_instruction_emul.c
1381 ↗(On Diff #70676)

The shift fails for the edge case of len = 64/start = 0.

I'll have a closer look at the emulation. Might even break out Intel XED for some comparison testing.

Thanks!

My understanding is that gpa is precomputed for me and I don’t need to add displacement or compute it from regs. Is that right?

That's right.

Thanks.

I did some tests with this vs running BEXTR in user-space, and so far haven't been able to find any issues other than the above-mentioned one, which I'd imagine isn't hit by the APIC code.

That's good, thanks.

What values are you seeing for start/len ?

I will have to add some debugging and get back to you. I know that the first access was a 32-bit one, so len would be at most 32.

sys/amd64/vmm/vmm_instruction_emul.c
1381 ↗(On Diff #70676)

Nice find, will fix.

Not a case you would expect the compiler to emit BEXTR for, I think :-).

Ok, in my testing, 100% of BEXTRs are 32-bit, start:16, len:8. This is the last few lines printed before what appears to be a poll-forever loop (3,565,378 instructions before I killed it to prevent the BEXTR log from getting too large):

MADT: Found IO APIC ID 0, Interrupt 0 at 0xfec00000
ioapic0: ver 0x11 maxredir 0x1f
ioapic0: Routing external 8259A's -> intpin 0
MADT: Interrupt override: source 0, irq 2
ioapic0: Routing IRQ 0 -> intpin 2
MADT: Interrupt override: source 9, irq 9
ioapic0: intpin 9 trigger: level
ioapic0: intpin 9 polarity: low
lapic: Routing NMI -> LINT1
lapic: LINT1 trigger: edge
lapic: LINT1 polarity: high
ioapic0 <Version 1.1> irqs 0-31

The print is in ioapic_register() and immediately afterwards we do:

933          * Reprogram pins to handle special case pins (such as NMI and
934          * SMI) and disable normal pins until a handler is registered.
935          */
936         intr_register_pic(&io->io_pic);
937         for (i = 0, pin = io->io_pins; i < io->io_numintr; i++, pin++)
938                 ioapic_reprogram_intpin(&pin->io_intsrc);

Nothing in either of these functions obviously loops 3 million times, so maybe it is the caller (ACPI MADT), or something later in boot. Nothing in madt_setup_io looks loopy either, so not that.

Anyway, the value obtained from memory appears to be 0xffffffffffffffff every time. This might make sense if we aren't emulating PIC memory correctly in our userspace reads. I'm not sure if I need to use a different API than memread() or what. Will dig a little.

Next in boot we might expect to see:

ioapic0 <Version 2.0> irqs 0-23 on motherboard
lapic: Divisor 2, Frequency 49887866 Hz
lapic: deadline tsc mode, Frequency 2394617036 Hz
cpu0 BSP:
     ID: 0x00000000   VER: 0x01060015 LDR: 0x00000001 DFR: 0x00000000 x2APIC: 1
  lint0: 0x00010700 lint1: 0x00000400 TPR: 0x00000000 SVR: 0x000011ff
  timer: 0x000100ef therm: 0x00010000 err: 0x000000f0 pmc: 0x00010400
   cmci: 0x000000f2
...

(On the basis of https://dmesgd.nycbug.org/index.cgi?do=view&id=3980 .) No matter; I think the problem is the device memory emulation.

I'm not seeing how we ever register a MMIO region for LAPIC_PADDR or IOAPIC_PADDR. Do we?

I see; it seems like there is some additional instruction emulation specific to APIC accesses in sys/amd64/vmm/io, and those ranges aren't exposed to userspace's MMIO.

In D24463#539489, @cem wrote:

I'm not seeing how we ever register a MMIO region for LAPIC_PADDR or IOAPIC_PADDR. Do we?

It's hard-coded as a special case in sys/amd64/vmm/vmm.c:vm_handle_inst_emul()

/* return to userland unless this is an in-kernel emulated device */
if (gpa >= DEFAULT_APIC_BASE && gpa < DEFAULT_APIC_BASE + PAGE_SIZE) {
        mread = lapic_mmio_read;
        mwrite = lapic_mmio_write;
In D24463#539490, @cem wrote:

I see; it seems like there is some additional instruction emulation specific to APIC accesses in sys/amd64/vmm/io, and those ranges aren't exposed to userspace's MMIO.

Right: the in-kernel emulations aren't really set up for mixed-mode. For example, there are side-effects on reads to certain LAPIC registers.

From the testing I did with a hacked-up itest rig, the bextr code looks fine and could be dropped in to the kernel instruction emulation code. Any issues there could be picked up with KTR/Dtrace.

cem marked an inline comment as done.Apr 21 2020, 8:58 PM
In D24463#539490, @cem wrote:

I see; it seems like there is some additional instruction emulation specific to APIC accesses in sys/amd64/vmm/io, and those ranges aren't exposed to userspace's MMIO.

Right: the in-kernel emulations aren't really set up for mixed-mode. For example, there are side-effects on reads to certain LAPIC registers.

Yep. I've got portions of a patch to expose that up to userspace for the fallback emulation via a new ioctl and vmmapi, for D24464. Should have something soonish.

From the testing I did with a hacked-up itest rig, the bextr code looks fine and could be dropped in to the kernel instruction emulation code. Any issues there could be picked up with KTR/Dtrace.

Cool. I've got the len==64 fix in my local tree and will plan to go ahead and commit that. Thanks for all your help!

This revision was automatically updated to reflect the committed changes.