Details
- Reviewers
markj mhorne gnn - Commits
- rG2d7bb03adb43: kinst: port to riscv
Diff Detail
- Repository
- rG FreeBSD src repository
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
266 | ||
272 | Where does that second breakpoint actually get written? | |
284 | See my comment below about kinst_trampchunk_alloc(). On riscv, all kernel memory is mapped above KERNBASE, so the reference here to KERNBASE is a bit nonsensical. | |
479 | I'd suggest llsc_block or so as a more accurate variable name. | |
523 | I'd print a warning here if we terminate without encountering the SC instruction. That shouldn't happen, but it's not illegal in any sense. | |
534 | The comment about amd64 in kinst_trampchunk_alloc() should be updated to reflect the handling of non-amd64 CPUs. The situation is simple for riscv: riscv's KERNBASE is just the base of the kernel's portion of the virtual memory map (not the case for amd64, which is special in that regard), and we don't particular care where in kernel memory trampolines are located. |
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
472 | Subword atomics have more than one instruction between the LR and SC, too, as does CAS's expansion. This is not robust. |
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
472 | I don't quite follow - it looks to me as though the implementation handles a loop longer than one instruction. Certainly it would be a bug if it didn't. | |
483 | Why are we subtracting previnstrsize here? Doesn't instr already point to the instruction we want to test? This looks somewhat dodgy, if only because we're casting an unaligned pointer (uint8_t *) to an aligned pointer (uint32_t *). | |
484 | I would suggest lifting the code which identifies LR/SC instructions into separate helper functions. | |
485 | It makes more sense to test atomic first rather than last. |
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
272 | The trampoline is initialized with EBREAKs, and because whenever we populate | |
460 | To be consistent with the use of uint*_t types in the rest of the code. | |
483 |
I'm testing the previous instruction, otherwise we'll exclude the instructions right before the SC one,
Since we know that instructions are going to be either 2 or 4 bytes, is this really a problem? | |
523 | I've got no particular objection to this, but what is the rationale behind it? |
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
272 | It would be less fragile to assume that, and instead write the second EBREAK here explicitly, even if it's less redundant. We are otherwise relying on a debugging feature of code in a different file. | |
483 |
Hmm, ok. I find this rather hard to understand from reading the code.
I'm not sure. Is there a compressed variant of the SC instruction? | |
523 | To potentially save time debugging in the case of a function with an unterminated LR/SC loop. In particular, a different function would presumably contain the corresponding SC instruction, and kinst will happily instrument it. |
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
483 | That doesn’t mean the instruction is aligned to 32 bits |
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
483 |
I might be missing something. If it's not aligned, the CPU will raise a misaligned exception, so why do we need to care about this? |
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
483 | When the C extension is present, 32-bit instructions need only be 16-bit aligned. Otherwise you’d have to nop-pad all over the place and seriously reduce the benefit of having compressed instructions. |
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
483 | I mean, is there a case where the current implementation might break? Can you share an example? |
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
483 | Start with a 4-byte-aligned compressed instruction. A subsequent uncompressed instruction will not be 4-byte-aligned. But we are loading it as if it were. The right way to handle that is to copy the potentially unaligned instruction into a temporary variable on the stack, then test that temporary variable. | |
487 | We don't have an equivalent to the amd64 code which tests whether the function starts by saving the frame pointer. For now I think we want such a test, similar to what FBT does. | |
549 | Why do you need lrsc_once? Can't we simply test lrsc_block? |
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
549 | lrsc_block stays true only if we do not hit an SC. |
I've been doing some testing in QEMU again. I'm occasionally able to trigger random "impossible" panics by running dtrace -n 'kinst::vm_fault:' -q & and running various programs in the foreground. top(1), running tests, etc.. It's rather difficult to reproduce though. :(
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
513 | Sorry, I still don't understand why the test for SC has to be written this way. Why can't it look like this: if (kinst_instr_is_lr(*insn)) { lrsc_block = true; goto next; } else if (kinst_instr_is_sc(*insn)) { lrsc_block = false; goto next; } and at the end of the main loop, have: if (lrsc_block) printf("warning: unterminated LL/SC block\n"); ? | |
549 | Yes, but that's the condition we're checking here... |
Fixing this doesn't fix the crashes, but kinst_trampoline_populate() is missing an i-fence.
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
513 | This is indeed a better way to handle it, although it needed some modification to actually work. Thanks! |
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
313 | This test is subtly wrong. The code is testing whether SPIE was set at the time that the kernel executed the breakpoint. What we want to know is whether SIE - not SPIE - was set at the time that the kernel executed the breakpoint. Suppose SIE is set, i.e., interrupts are enabled. Then, when the CPU executes a tracepoint, it will atomically clear SIE and set SPIE. So, since kinst_invop() wants to know if interrupts were enabled, it should be testing csr_read(sstatus) & SSTATUS_SPIE == 0. (And kinst_invop() should assert that csr_read(sstatus) & SSTATUS_SIE == 0.) Testing frame->tf_sstatus tells you whether SPIE was set immediately before the breakpoint was executed. But that's not the information we're looking for. | |
325 | If the bug above is corrected, this check will always be false. We execute trampolines with interrupts disabled, so when handling the second breakpoint, kinst_invop() will always be called with SPIE clear. |
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
313 | Sorry: thinking a bit more, my report is incorrect. Upon an exception sstatus is saved by software, not by the CPU, so the saved SPIE reflects the correct value. Checking the current value of sstatus would work too, I think, but it wouldn't change anything. |
Address comments, fix undefined behavior when working with potentially unaligned pointers.
This provider can trigger what appears to be a QEMU/TCG bug. The test case:
- boot a riscv VM with more than one vCPU
- run dtrace -q -n 'kinst::riscv_cpu_intr:', i.e., trace all instructions in riscv_cpu_intr()
The kernel will usually crash within a minute, typically due to a page fault, though sometimes we also see freezes or symptoms of a lost interrupt (e.g., a thread waiting forever for an IPI to complete). Looking at the crashes, the trapping instruction is usually in a kinst trampoline, but the exception doesn't make sense; for instance, a page fault raised by an add. Or, some code crashes shortly after executing some tracepoints because its register file is corrupted somehow. When there is a single vCPU, the problem goes away. The problem is not tied to any particular probed instruction, but no crashes occur if there is only one tracepoint.
In the current design of kinst, there is one trampoline (executable memory used for single-stepping) per thread and per CPU, rather than one trampoline per probe. This means that trampolines are frequently being rewritten. In the test case above, only the per-CPU trampolines are used (since riscv_cpu_intr() is executed with interrupts disabled). This means that multiple CPUs are frequently writing code to a single page, which means that TCG is frequently trapping write accesses and rewriting blocks.
I eventually found that using one page per trampoline makes the problem go away even when multiple vCPUs are involved. From this I surmise that the problem is in QEMU rather than in kinst (not sure yet if this is riscv-specific). In particular, TCG handles self-modifying code by using mprotect() to trap writes, and it's easy to imagine races occurring if multiple CPUs are simultaneously writing/executing the same page. We need to dig into that, or change kinst's design to avoid tickling this problem, for instance by going back to per-probe trampolines[**]. But I think it's ok to land this patch in the meantime.
[**] This is actually easy to do on riscv since we already emulate all control flow instructions. All we need to do is 1) start allocating one trampoline per probe, 2) modify kinst_invop() to execute trampolines with interrupts disabled, 3) when trapping into kinst_invop() the second time, restore the old interrupt state. Then, in order to free a trampoline, we just need to raise an interrupt on all CPUs and wait for them to acknowledge it, i.e., call smp_rendezvous() with an empty handler.
sys/cddl/dev/kinst/riscv/kinst_isa.c | ||
---|---|---|
499 | We should skip over ebreak and ebreak.c here. We should also skip over anything with the SYSTEM opcode prefix; those instructions modify system registers and can do things like enable or disable interrupts. |
Seems I was a bit slow to report my findings, but I did observe some of the same kinds of panics on real hardware that I saw in QEMU. I will update to the committed version and provide a more detailed answer of what is still not working.
Oops, sorry, I hadn't realized you were still testing. Note that kinst.ko doesn't get loaded automatically on riscv for now, you have to manually load it.
Can you recompile kinst with this patch applied? https://reviews.freebsd.org/P592
If we run it with a single trampoline per page, the crashes seem to go away.
Is it just a case of needing to smp_rendezvous a fence_i since it's not broadcast by the hardware like on arm64?
I don't think so: in the problematic scenario we're writing instructions that will only be executed on the current CPU. In that case I'd expect a local fence.i to be sufficient.