⚙ D39884 kinst: port to riscv

markj added inline comments.May 24 2023, 3:35 PM

sys/cddl/dev/kinst/riscv/kinst_isa.c
266
272	Where does that second breakpoint actually get written?
284	See my comment below about kinst_trampchunk_alloc(). On riscv, all kernel memory is mapped above KERNBASE, so the reference here to KERNBASE is a bit nonsensical.
479	I'd suggest `llsc_block` or so as a more accurate variable name.
523	I'd print a warning here if we terminate without encountering the SC instruction. That shouldn't happen, but it's not illegal in any sense.
534	The comment about amd64 in kinst_trampchunk_alloc() should be updated to reflect the handling of non-amd64 CPUs. The situation is simple for riscv: riscv's KERNBASE is just the base of the kernel's portion of the virtual memory map (not the case for amd64, which is special in that regard), and we don't particular care where in kernel memory trampolines are located.

jrtc27 added a subscriber: jrtc27.May 24 2023, 3:39 PM

jrtc27 added inline comments.

sys/cddl/dev/kinst/riscv/kinst_isa.c
472	Subword atomics have more than one instruction between the LR and SC, too, as does CAS's expansion. This is not robust.

markj added inline comments.May 24 2023, 3:56 PM

sys/cddl/dev/kinst/riscv/kinst_isa.c
472	I don't quite follow - it looks to me as though the implementation handles a loop longer than one instruction. Certainly it would be a bug if it didn't.
483	Why are we subtracting `previnstrsize` here? Doesn't `instr` already point to the instruction we want to test? This looks somewhat dodgy, if only because we're casting an unaligned pointer (uint8_t ) to an aligned pointer (uint32_t ).
484	I would suggest lifting the code which identifies LR/SC instructions into separate helper functions.
485	It makes more sense to test `atomic` first rather than last.

jrtc27 added inline comments.May 24 2023, 4:04 PM

sys/cddl/dev/kinst/riscv/kinst_isa.c
460	Why uint8_t rather than char?
469	This may be unaligned
472	Never mind, I was skimming the code and hadn't got the right end of the stick

christos added inline comments.May 25 2023, 1:19 PM

sys/cddl/dev/kinst/riscv/kinst_isa.c
272	The trampoline is initialized with EBREAKs, and because whenever we populate the trampoline, we essentially write only the first 4 bytes of it, the remaining 4 bytes will keep the EBREAK, so we only have to modify those first 4 in order to replace the copied instruction and the EBREAK stays in tact.
460	To be consistent with the use of `uint*_t` types in the rest of the code.
483	Why are we subtracting previnstrsize here? Doesn't instr already point to the instruction we want to test? I'm testing the previous instruction, otherwise we'll exclude the instructions right before the SC one, instead of excluding the SC as well. This looks somewhat dodgy, if only because we're casting an unaligned pointer (uint8_t ) to an aligned pointer (uint32_t ). Since we know that instructions are going to be either 2 or 4 bytes, is this really a problem?
523	I've got no particular objection to this, but what is the rationale behind it?

markj added inline comments.May 25 2023, 1:32 PM

sys/cddl/dev/kinst/riscv/kinst_isa.c
272	It would be less fragile to assume that, and instead write the second EBREAK here explicitly, even if it's less redundant. We are otherwise relying on a debugging feature of code in a different file.
483	I'm testing the previous instruction, otherwise we'll exclude the instructions right before the SC one, instead of excluding the SC as well. Hmm, ok. I find this rather hard to understand from reading the code. Since we know that instructions are going to be either 2 or 4 bytes, is this really a problem? I'm not sure. Is there a compressed variant of the SC instruction?
523	To potentially save time debugging in the case of a function with an unterminated LR/SC loop. In particular, a different function would presumably contain the corresponding SC instruction, and kinst will happily instrument it.

christos marked 10 inline comments as done.May 25 2023, 1:39 PM

christos added inline comments.

sys/cddl/dev/kinst/riscv/kinst_isa.c
272	Alright, no problem.
483	Hmm, ok. I find this rather hard to understand from reading the code. I know, I can add a comment explaining this. I'm not sure. Is there a compressed variant of the SC instruction? No.
523	Makes sense.

jrtc27 added inline comments.May 25 2023, 2:36 PM

sys/cddl/dev/kinst/riscv/kinst_isa.c
483	That doesn’t mean the instruction is aligned to 32 bits

christos marked 3 inline comments as done.May 25 2023, 2:45 PM

christos added inline comments.

sys/cddl/dev/kinst/riscv/kinst_isa.c
483	That doesn’t mean the instruction is aligned to 32 bits I might be missing something. If it's not aligned, the CPU will raise a misaligned exception, so why do we need to care about this?

jrtc27 added inline comments.May 25 2023, 3:01 PM

sys/cddl/dev/kinst/riscv/kinst_isa.c
483	When the C extension is present, 32-bit instructions need only be 16-bit aligned. Otherwise you’d have to nop-pad all over the place and seriously reduce the benefit of having compressed instructions.

christos marked an inline comment as done and an inline comment as not done.May 25 2023, 3:06 PM

christos added inline comments.

sys/cddl/dev/kinst/riscv/kinst_isa.c
483	I mean, is there a case where the current implementation might break? Can you share an example?

christos marked 4 inline comments as done.May 25 2023, 3:33 PM

Address comments.

markj added inline comments.May 26 2023, 1:35 PM

sys/cddl/dev/kinst/riscv/kinst_isa.c
483	Start with a 4-byte-aligned compressed instruction. A subsequent uncompressed instruction will not be 4-byte-aligned. But we are loading it as if it were. The right way to handle that is to copy the potentially unaligned instruction into a temporary variable on the stack, then test that temporary variable.
487	We don't have an equivalent to the amd64 code which tests whether the function starts by saving the frame pointer. For now I think we want such a test, similar to what FBT does.
549	Why do you need `lrsc_once`? Can't we simply test `lrsc_block`?

Use kinst_memcpy() and also add kinst_md_excluded().

christos added a parent revision: D40284: kinst: fix memcpy() tracing crash.May 26 2023, 3:11 PM

christos removed a parent revision: D39229: kinst: add kinst_excluded().

Do not use kinst_memcpy outside of probe context.

christos removed a parent revision: D40284: kinst: fix memcpy() tracing crash.May 26 2023, 3:45 PM

christos marked an inline comment as done.May 30 2023, 3:28 PM

christos added inline comments.

sys/cddl/dev/kinst/riscv/kinst_isa.c
549	`lrsc_block` stays true only if we do not hit an SC.

Check for the function prologue.

christos added a child revision: D40337: kinst: port to arm64.May 30 2023, 3:40 PM

I've been doing some testing in QEMU again. I'm occasionally able to trigger random "impossible" panics by running dtrace -n 'kinst::vm_fault:' -q & and running various programs in the foreground. top(1), running tests, etc.. It's rather difficult to reproduce though. :(

sys/cddl/dev/kinst/riscv/kinst_isa.c
513	Sorry, I still don't understand why the test for SC has to be written this way. Why can't it look like this: if (kinst_instr_is_lr(insn)) { lrsc_block = true; goto next; } else if (kinst_instr_is_sc(insn)) { lrsc_block = false; goto next; } and at the end of the main loop, have: if (lrsc_block) printf("warning: unterminated LL/SC block\n"); ?
549	Yes, but that's the condition we're checking here...

In D39884#918563, @markj wrote:

I've been doing some testing in QEMU again. I'm occasionally able to trigger random "impossible" panics by running dtrace -n 'kinst::vm_fault:' -q & and running various programs in the foreground. top(1), running tests, etc.. It's rather difficult to reproduce though. :(

Fixing this doesn't fix the crashes, but kinst_trampoline_populate() is missing an i-fence.

christos marked 5 inline comments as done.May 31 2023, 7:31 PM

christos added inline comments.

sys/cddl/dev/kinst/riscv/kinst_isa.c
513	This is indeed a better way to handle it, although it needed some modification to actually work. Thanks!

Address comments.

christos retitled this revision from kinst: port to RISC-V to kinst: port to riscv.May 31 2023, 8:24 PM

markj added inline comments.Jun 25 2023, 10:15 PM

sys/cddl/dev/kinst/riscv/kinst_isa.c
313	This test is subtly wrong. The code is testing whether SPIE was set at the time that the kernel executed the breakpoint. What we want to know is whether SIE - not SPIE - was set at the time that the kernel executed the breakpoint. Suppose SIE is set, i.e., interrupts are enabled. Then, when the CPU executes a tracepoint, it will atomically clear SIE and set SPIE. So, since kinst_invop() wants to know if interrupts were enabled, it should be testing `csr_read(sstatus) & SSTATUS_SPIE == 0`. (And kinst_invop() should assert that `csr_read(sstatus) & SSTATUS_SIE == 0`.) Testing `frame->tf_sstatus` tells you whether SPIE was set immediately before the breakpoint was executed. But that's not the information we're looking for.
325	If the bug above is corrected, this check will always be false. We execute trampolines with interrupts disabled, so when handling the second breakpoint, kinst_invop() will always be called with SPIE clear.

markj added inline comments.Jun 25 2023, 10:39 PM

sys/cddl/dev/kinst/riscv/kinst_isa.c
313	Sorry: thinking a bit more, my report is incorrect. Upon an exception sstatus is saved by software, not by the CPU, so the saved SPIE reflects the correct value. Checking the current value of sstatus would work too, I think, but it wouldn't change anything.

Address comments, fix undefined behavior when working with potentially unaligned pointers.

This provider can trigger what appears to be a QEMU/TCG bug. The test case:

boot a riscv VM with more than one vCPU
run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()

The kernel will usually crash within a minute, typically due to a page fault, though sometimes we also see freezes or symptoms of a lost interrupt (e.g., a thread waiting forever for an IPI to complete). Looking at the crashes, the trapping instruction is usually in a kinst trampoline, but the exception doesn't make sense; for instance, a page fault raised by an add. Or, some code crashes shortly after executing some tracepoints because its register file is corrupted somehow. When there is a single vCPU, the problem goes away. The problem is not tied to any particular probed instruction, but no crashes occur if there is only one tracepoint.

In the current design of kinst, there is one trampoline (executable memory used for single-stepping) per thread and per CPU, rather than one trampoline per probe. This means that trampolines are frequently being rewritten. In the test case above, only the per-CPU trampolines are used (since riscv_cpu_intr() is executed with interrupts disabled). This means that multiple CPUs are frequently writing code to a single page, which means that TCG is frequently trapping write accesses and rewriting blocks.

I eventually found that using one page per trampoline makes the problem go away even when multiple vCPUs are involved. From this I surmise that the problem is in QEMU rather than in kinst (not sure yet if this is riscv-specific). In particular, TCG handles self-modifying code by using mprotect() to trap writes, and it's easy to imagine races occurring if multiple CPUs are simultaneously writing/executing the same page. We need to dig into that, or change kinst's design to avoid tickling this problem, for instance by going back to per-probe trampolines[**]. But I think it's ok to land this patch in the meantime.

[**] This is actually easy to do on riscv since we already emulate all control flow instructions. All we need to do is 1) start allocating one trampoline per probe, 2) modify kinst_invop() to execute trampolines with interrupts disabled, 3) when trapping into kinst_invop() the second time, restore the old interrupt state. Then, in order to free a trampoline, we just need to raise an interrupt on all CPUs and wait for them to acknowledge it, i.e., call smp_rendezvous() with an empty handler.

sys/cddl/dev/kinst/riscv/kinst_isa.c
500	We should skip over `ebreak` and `ebreak.c` here. We should also skip over anything with the SYSTEM opcode prefix; those instructions modify system registers and can do things like enable or disable interrupts.

Address comments, update LICENSE headers, split some changes to other PRs.

christos added a parent revision: D40783: dtraceall: remove kinst.Jul 4 2023, 9:39 AM

christos edited parent revisions, added: D40873: kinst: skip trampoline searching for riscv and arm64; removed: D40783: dtraceall: remove kinst.

Approved.

This revision is now accepted and ready to land.Jul 4 2023, 1:16 PM

Restore original trampoline size.

This revision now requires review to proceed.Jul 4 2023, 1:57 PM

markj accepted this revision.Jul 4 2023, 2:37 PM

This revision is now accepted and ready to land.Jul 4 2023, 2:37 PM

Retain CDDL license.

This revision now requires review to proceed.Jul 4 2023, 2:47 PM

markj accepted this revision.Jul 4 2023, 2:48 PM

This revision is now accepted and ready to land.Jul 4 2023, 2:48 PM

Closed by commit rG2d7bb03adb43: kinst: port to riscv (authored by christos). · Explain WhyJul 4 2023, 3:39 PM

This revision was automatically updated to reflect the committed changes.

christos added a commit: rG2d7bb03adb43: kinst: port to riscv.

In D39884#928340, @markj wrote:

This provider can trigger what appears to be a QEMU/TCG bug. The test case:

boot a riscv VM with more than one vCPU

run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()

Seems I was a bit slow to report my findings, but I did observe some of the same kinds of panics on real hardware that I saw in QEMU. I will update to the committed version and provide a more detailed answer of what is still not working.

In D39884#930225, @mhorne wrote:

In D39884#928340, @markj wrote:

This provider can trigger what appears to be a QEMU/TCG bug. The test case:

boot a riscv VM with more than one vCPU

run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()

Seems I was a bit slow to report my findings, but I did observe some of the same kinds of panics on real hardware that I saw in QEMU. I will update to the committed version and provide a more detailed answer of what is still not working.

Oops, sorry, I hadn't realized you were still testing. Note that kinst.ko doesn't get loaded automatically on riscv for now, you have to manually load it.

In D39884#930225, @mhorne wrote:

In D39884#928340, @markj wrote:

This provider can trigger what appears to be a QEMU/TCG bug. The test case:

boot a riscv VM with more than one vCPU

run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()

Seems I was a bit slow to report my findings, but I did observe some of the same kinds of panics on real hardware that I saw in QEMU. I will update to the committed version and provide a more detailed answer of what is still not working.

Can you recompile kinst with this patch applied? https://reviews.freebsd.org/P592
If we run it with a single trampoline per page, the crashes seem to go away.

In D39884#930234, @christos wrote:

In D39884#930225, @mhorne wrote:

In D39884#928340, @markj wrote:

This provider can trigger what appears to be a QEMU/TCG bug. The test case:

boot a riscv VM with more than one vCPU

run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()

Seems I was a bit slow to report my findings, but I did observe some of the same kinds of panics on real hardware that I saw in QEMU. I will update to the committed version and provide a more detailed answer of what is still not working.

Can you recompile kinst with this patch applied? https://reviews.freebsd.org/P592
If we run it with a single trampoline per page, the crashes seem to go away.

Is it just a case of needing to smp_rendezvous a fence_i since it's not broadcast by the hardware like on arm64?

In D39884#930242, @jrtc27 wrote:

Is it just a case of needing to smp_rendezvous a fence_i since it's not broadcast by the hardware like on arm64?

arm64 has the same issue though.

In D39884#930242, @jrtc27 wrote:

In D39884#930234, @christos wrote:

In D39884#930225, @mhorne wrote:

In D39884#928340, @markj wrote:

This provider can trigger what appears to be a QEMU/TCG bug. The test case:

boot a riscv VM with more than one vCPU

run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()

Seems I was a bit slow to report my findings, but I did observe some of the same kinds of panics on real hardware that I saw in QEMU. I will update to the committed version and provide a more detailed answer of what is still not working.

Can you recompile kinst with this patch applied? https://reviews.freebsd.org/P592
If we run it with a single trampoline per page, the crashes seem to go away.

Is it just a case of needing to smp_rendezvous a fence_i since it's not broadcast by the hardware like on arm64?

I don't think so: in the problematic scenario we're writing instructions that will only be executed on the current CPU. In that case I'd expect a local fence.i to be sufficient.

christos removed a child revision: D40337: kinst: port to arm64.Jul 10 2023, 8:43 PM

kinst: port to riscv
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 124173

sys/cddl/dev/kinst/riscv/kinst_isa.h

sys/cddl/dev/kinst/riscv/kinst_isa.c

sys/modules/dtrace/Makefile

kinst: port to riscvClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 124173

sys/cddl/dev/kinst/riscv/kinst_isa.h

sys/cddl/dev/kinst/riscv/kinst_isa.c

sys/modules/dtrace/Makefile

kinst: port to riscv
ClosedPublic
Actions

Revision Contents
Changeset List