Page MenuHomeFreeBSD

kinst: port to riscv
ClosedPublic

Authored by christos on Apr 30 2023, 2:17 PM.
Referenced Files
F108310231: D39884.id124166.diff
Thu, Jan 23, 6:37 PM
Unknown Object (File)
Wed, Jan 22, 6:41 AM
Unknown Object (File)
Sun, Jan 19, 3:26 AM
Unknown Object (File)
Sat, Jan 18, 3:49 PM
Unknown Object (File)
Wed, Jan 15, 1:56 PM
Unknown Object (File)
Thu, Jan 9, 5:17 PM
Unknown Object (File)
Thu, Jan 9, 5:16 PM
Unknown Object (File)
Thu, Jan 9, 5:05 PM
Subscribers

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

There are a very large number of changes, so older changes are hidden. Show Older Changes
sys/cddl/dev/kinst/riscv/kinst_isa.c
266
272

Where does that second breakpoint actually get written?

284

See my comment below about kinst_trampchunk_alloc(). On riscv, all kernel memory is mapped above KERNBASE, so the reference here to KERNBASE is a bit nonsensical.

479

I'd suggest llsc_block or so as a more accurate variable name.

523

I'd print a warning here if we terminate without encountering the SC instruction. That shouldn't happen, but it's not illegal in any sense.

534

The comment about amd64 in kinst_trampchunk_alloc() should be updated to reflect the handling of non-amd64 CPUs. The situation is simple for riscv: riscv's KERNBASE is just the base of the kernel's portion of the virtual memory map (not the case for amd64, which is special in that regard), and we don't particular care where in kernel memory trampolines are located.

jrtc27 added inline comments.
sys/cddl/dev/kinst/riscv/kinst_isa.c
472

Subword atomics have more than one instruction between the LR and SC, too, as does CAS's expansion. This is not robust.

sys/cddl/dev/kinst/riscv/kinst_isa.c
472

I don't quite follow - it looks to me as though the implementation handles a loop longer than one instruction. Certainly it would be a bug if it didn't.

483

Why are we subtracting previnstrsize here? Doesn't instr already point to the instruction we want to test?

This looks somewhat dodgy, if only because we're casting an unaligned pointer (uint8_t *) to an aligned pointer (uint32_t *).

484

I would suggest lifting the code which identifies LR/SC instructions into separate helper functions.

485

It makes more sense to test atomic first rather than last.

sys/cddl/dev/kinst/riscv/kinst_isa.c
460

Why uint8_t rather than char?

469

This may be unaligned

472

Never mind, I was skimming the code and hadn't got the right end of the stick

sys/cddl/dev/kinst/riscv/kinst_isa.c
272

The trampoline is initialized with EBREAKs, and because whenever we populate
the trampoline, we essentially write only the first 4 bytes of it, the
remaining 4 bytes will keep the EBREAK, so we only have to modify those first 4
in order to replace the copied instruction and the EBREAK stays in tact.

460

To be consistent with the use of uint*_t types in the rest of the code.

483

Why are we subtracting previnstrsize here? Doesn't instr already point to the instruction we want to test?

I'm testing the previous instruction, otherwise we'll exclude the instructions right before the SC one,
instead of excluding the SC as well.

This looks somewhat dodgy, if only because we're casting an unaligned pointer (uint8_t *) to an aligned pointer (uint32_t *).

Since we know that instructions are going to be either 2 or 4 bytes, is this really a problem?

523

I've got no particular objection to this, but what is the rationale behind it?

sys/cddl/dev/kinst/riscv/kinst_isa.c
272

It would be less fragile to assume that, and instead write the second EBREAK here explicitly, even if it's less redundant. We are otherwise relying on a debugging feature of code in a different file.

483

I'm testing the previous instruction, otherwise we'll exclude the instructions right before the SC one, instead of excluding the SC as well.

Hmm, ok. I find this rather hard to understand from reading the code.

Since we know that instructions are going to be either 2 or 4 bytes, is this really a problem?

I'm not sure. Is there a compressed variant of the SC instruction?

523

To potentially save time debugging in the case of a function with an unterminated LR/SC loop. In particular, a different function would presumably contain the corresponding SC instruction, and kinst will happily instrument it.

christos added inline comments.
sys/cddl/dev/kinst/riscv/kinst_isa.c
272

Alright, no problem.

483

Hmm, ok. I find this rather hard to understand from reading the code.

I know, I can add a comment explaining this.

I'm not sure. Is there a compressed variant of the SC instruction?

No.

523

Makes sense.

sys/cddl/dev/kinst/riscv/kinst_isa.c
483

That doesn’t mean the instruction is aligned to 32 bits

christos added inline comments.
sys/cddl/dev/kinst/riscv/kinst_isa.c
483

That doesn’t mean the instruction is aligned to 32 bits

I might be missing something. If it's not aligned, the CPU will raise a misaligned exception, so why do we need to care about this?

sys/cddl/dev/kinst/riscv/kinst_isa.c
483

When the C extension is present, 32-bit instructions need only be 16-bit aligned. Otherwise you’d have to nop-pad all over the place and seriously reduce the benefit of having compressed instructions.

christos marked an inline comment as done and an inline comment as not done.May 25 2023, 3:06 PM
christos added inline comments.
sys/cddl/dev/kinst/riscv/kinst_isa.c
483

I mean, is there a case where the current implementation might break? Can you share an example?

sys/cddl/dev/kinst/riscv/kinst_isa.c
483

Start with a 4-byte-aligned compressed instruction. A subsequent uncompressed instruction will not be 4-byte-aligned. But we are loading it as if it were. The right way to handle that is to copy the potentially unaligned instruction into a temporary variable on the stack, then test that temporary variable.

487

We don't have an equivalent to the amd64 code which tests whether the function starts by saving the frame pointer. For now I think we want such a test, similar to what FBT does.

549

Why do you need lrsc_once? Can't we simply test lrsc_block?

Use kinst_memcpy() and also add kinst_md_excluded().

Do not use kinst_memcpy outside of probe context.

christos added inline comments.
sys/cddl/dev/kinst/riscv/kinst_isa.c
549

lrsc_block stays true only if we do not hit an SC.

Check for the function prologue.

I've been doing some testing in QEMU again. I'm occasionally able to trigger random "impossible" panics by running dtrace -n 'kinst::vm_fault:' -q & and running various programs in the foreground. top(1), running tests, etc.. It's rather difficult to reproduce though. :(

sys/cddl/dev/kinst/riscv/kinst_isa.c
513

Sorry, I still don't understand why the test for SC has to be written this way. Why can't it look like this:

if (kinst_instr_is_lr(*insn)) {
    lrsc_block = true;
    goto next;
} else if (kinst_instr_is_sc(*insn)) {
    lrsc_block = false;
    goto next;
}

and at the end of the main loop, have:

if (lrsc_block)
    printf("warning: unterminated LL/SC block\n");

?

549

Yes, but that's the condition we're checking here...

I've been doing some testing in QEMU again. I'm occasionally able to trigger random "impossible" panics by running dtrace -n 'kinst::vm_fault:' -q & and running various programs in the foreground. top(1), running tests, etc.. It's rather difficult to reproduce though. :(

Fixing this doesn't fix the crashes, but kinst_trampoline_populate() is missing an i-fence.

christos added inline comments.
sys/cddl/dev/kinst/riscv/kinst_isa.c
513

This is indeed a better way to handle it, although it needed some modification to actually work. Thanks!

christos marked an inline comment as done.

Address comments.

christos retitled this revision from kinst: port to RISC-V to kinst: port to riscv.May 31 2023, 8:24 PM
sys/cddl/dev/kinst/riscv/kinst_isa.c
313

This test is subtly wrong. The code is testing whether SPIE was set at the time that the kernel executed the breakpoint. What we want to know is whether SIE - not SPIE - was set at the time that the kernel executed the breakpoint.

Suppose SIE is set, i.e., interrupts are enabled. Then, when the CPU executes a tracepoint, it will atomically clear SIE and set SPIE. So, since kinst_invop() wants to know if interrupts were enabled, it should be testing csr_read(sstatus) & SSTATUS_SPIE == 0. (And kinst_invop() should assert that csr_read(sstatus) & SSTATUS_SIE == 0.)

Testing frame->tf_sstatus tells you whether SPIE was set immediately before the breakpoint was executed. But that's not the information we're looking for.

325

If the bug above is corrected, this check will always be false. We execute trampolines with interrupts disabled, so when handling the second breakpoint, kinst_invop() will always be called with SPIE clear.

sys/cddl/dev/kinst/riscv/kinst_isa.c
313

Sorry: thinking a bit more, my report is incorrect. Upon an exception sstatus is saved by software, not by the CPU, so the saved SPIE reflects the correct value. Checking the current value of sstatus would work too, I think, but it wouldn't change anything.

christos marked 5 inline comments as done.

Address comments, fix undefined behavior when working with potentially unaligned pointers.

This provider can trigger what appears to be a QEMU/TCG bug. The test case:

  • boot a riscv VM with more than one vCPU
  • run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()

The kernel will usually crash within a minute, typically due to a page fault, though sometimes we also see freezes or symptoms of a lost interrupt (e.g., a thread waiting forever for an IPI to complete). Looking at the crashes, the trapping instruction is usually in a kinst trampoline, but the exception doesn't make sense; for instance, a page fault raised by an add. Or, some code crashes shortly after executing some tracepoints because its register file is corrupted somehow. When there is a single vCPU, the problem goes away. The problem is not tied to any particular probed instruction, but no crashes occur if there is only one tracepoint.

In the current design of kinst, there is one trampoline (executable memory used for single-stepping) per thread and per CPU, rather than one trampoline per probe. This means that trampolines are frequently being rewritten. In the test case above, only the per-CPU trampolines are used (since riscv_cpu_intr() is executed with interrupts disabled). This means that multiple CPUs are frequently writing code to a single page, which means that TCG is frequently trapping write accesses and rewriting blocks.

I eventually found that using one page per trampoline makes the problem go away even when multiple vCPUs are involved. From this I surmise that the problem is in QEMU rather than in kinst (not sure yet if this is riscv-specific). In particular, TCG handles self-modifying code by using mprotect() to trap writes, and it's easy to imagine races occurring if multiple CPUs are simultaneously writing/executing the same page. We need to dig into that, or change kinst's design to avoid tickling this problem, for instance by going back to per-probe trampolines[**]. But I think it's ok to land this patch in the meantime.

[**] This is actually easy to do on riscv since we already emulate all control flow instructions. All we need to do is 1) start allocating one trampoline per probe, 2) modify kinst_invop() to execute trampolines with interrupts disabled, 3) when trapping into kinst_invop() the second time, restore the old interrupt state. Then, in order to free a trampoline, we just need to raise an interrupt on all CPUs and wait for them to acknowledge it, i.e., call smp_rendezvous() with an empty handler.

sys/cddl/dev/kinst/riscv/kinst_isa.c
500

We should skip over ebreak and ebreak.c here. We should also skip over anything with the SYSTEM opcode prefix; those instructions modify system registers and can do things like enable or disable interrupts.

christos marked an inline comment as done.

Address comments, update LICENSE headers, split some changes to other PRs.

This revision is now accepted and ready to land.Jul 4 2023, 1:16 PM

Restore original trampoline size.

This revision now requires review to proceed.Jul 4 2023, 1:57 PM
This revision is now accepted and ready to land.Jul 4 2023, 2:37 PM
This revision now requires review to proceed.Jul 4 2023, 2:47 PM
This revision is now accepted and ready to land.Jul 4 2023, 2:48 PM
This revision was automatically updated to reflect the committed changes.

This provider can trigger what appears to be a QEMU/TCG bug. The test case:

  • boot a riscv VM with more than one vCPU
  • run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()

Seems I was a bit slow to report my findings, but I did observe some of the same kinds of panics on real hardware that I saw in QEMU. I will update to the committed version and provide a more detailed answer of what is still not working.

This provider can trigger what appears to be a QEMU/TCG bug. The test case:

  • boot a riscv VM with more than one vCPU
  • run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()

Seems I was a bit slow to report my findings, but I did observe some of the same kinds of panics on real hardware that I saw in QEMU. I will update to the committed version and provide a more detailed answer of what is still not working.

Oops, sorry, I hadn't realized you were still testing. Note that kinst.ko doesn't get loaded automatically on riscv for now, you have to manually load it.

This provider can trigger what appears to be a QEMU/TCG bug. The test case:

  • boot a riscv VM with more than one vCPU
  • run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()

Seems I was a bit slow to report my findings, but I did observe some of the same kinds of panics on real hardware that I saw in QEMU. I will update to the committed version and provide a more detailed answer of what is still not working.

Can you recompile kinst with this patch applied? https://reviews.freebsd.org/P592
If we run it with a single trampoline per page, the crashes seem to go away.

This provider can trigger what appears to be a QEMU/TCG bug. The test case:

  • boot a riscv VM with more than one vCPU
  • run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()

Seems I was a bit slow to report my findings, but I did observe some of the same kinds of panics on real hardware that I saw in QEMU. I will update to the committed version and provide a more detailed answer of what is still not working.

Can you recompile kinst with this patch applied? https://reviews.freebsd.org/P592
If we run it with a single trampoline per page, the crashes seem to go away.

Is it just a case of needing to smp_rendezvous a fence_i since it's not broadcast by the hardware like on arm64?

Is it just a case of needing to smp_rendezvous a fence_i since it's not broadcast by the hardware like on arm64?

arm64 has the same issue though.

This provider can trigger what appears to be a QEMU/TCG bug. The test case:

  • boot a riscv VM with more than one vCPU
  • run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()

Seems I was a bit slow to report my findings, but I did observe some of the same kinds of panics on real hardware that I saw in QEMU. I will update to the committed version and provide a more detailed answer of what is still not working.

Can you recompile kinst with this patch applied? https://reviews.freebsd.org/P592
If we run it with a single trampoline per page, the crashes seem to go away.

Is it just a case of needing to smp_rendezvous a fence_i since it's not broadcast by the hardware like on arm64?

I don't think so: in the problematic scenario we're writing instructions that will only be executed on the current CPU. In that case I'd expect a local fence.i to be sufficient.