Page MenuHomeFreeBSD

riscv: Permit spurious faults in kernel mode
ClosedPublic

Authored by markj on Nov 20 2024, 6:23 PM.
Tags
None
Referenced Files
Unknown Object (File)
Tue, Dec 10, 3:22 PM
Unknown Object (File)
Tue, Dec 10, 5:46 AM
Unknown Object (File)
Thu, Dec 5, 2:18 AM
Unknown Object (File)
Nov 23 2024, 9:14 AM
Unknown Object (File)
Nov 21 2024, 4:00 PM
Unknown Object (File)
Nov 21 2024, 4:00 PM
Unknown Object (File)
Nov 21 2024, 3:39 PM
Unknown Object (File)
Nov 21 2024, 10:04 AM
Subscribers

Details

Summary

Right now, pmap_enter() does not issue an sfence.vma after overwriting
an invalid PTE, so the kernel can trigger a page fault when accessing a
freshly created mapping. In this case, pmap_fault() can handle the
exception, but we may panic before that. Move the check; this is
consistent with arm64 and serves to ensure that we don't call vm_fault()
etc. from a context where that's not permitted.

Also fix a related bug: don't enable interrupts if they were disabled in
the context where the exception occurred.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

markj requested review of this revision.Nov 20 2024, 6:23 PM

Presumably in practice the td_critnest check meant the interrupt enabling didn't matter, as spinlock_enter increments it? I assume there aren't many places where we disable interrupts without also being inside a critical section, so it would only be a problem if that sequence of code itself took a spurious fault.

Also, perhaps worth pointing out arm64 does an unconditional intr_enable for the kernel too...

And amd64 has:

if ((frame->tf_rflags & PSL_I) == 0) {
        /*
         * Buggy application or kernel code has disabled
         * interrupts and then trapped.  Enabling interrupts
         * now is wrong, but it is better than running with
         * interrupts disabled until they are accidentally
         * enabled later.
         */
        if (TRAPF_USERMODE(frame)) {
                uprintf(
                    "pid %ld (%s): trap %d (%s) "
                    "with interrupts disabled\n",
                    (long)curproc->p_pid, curthread->td_name, type,
                    trap_msg[type]);
        } else {
                switch (type) {
                case T_NMI:
                case T_BPTFLT:
                case T_TRCTRAP:
                case T_PROTFLT:
                case T_SEGNPFLT:
                case T_STKFLT:
                        break;
                default:
                        printf(
                            "kernel trap %d with interrupts disabled\n",
                            type);

                        /*
                         * We shouldn't enable interrupts while holding a
                         * spin lock.
                         */
                        if (td->td_md.md_spinlock_count == 0)
                                enable_intr();
                }
        }
}

So I think the expectation is really that this shouldn't happen. Whether that's true or not I don't know :)

Presumably in practice the td_critnest check meant the interrupt enabling didn't matter, as spinlock_enter increments it? I assume there aren't many places where we disable interrupts without also being inside a critical section, so it would only be a problem if that sequence of code itself took a spurious fault.

spinlock_enter() increments td_critnest, but so does critical_enter(), which doesn't disable interrupts. The point of the critnest check is, I think, to make sure we don't do anything which might involve rescheduling the current thread. In particular, it's invalid to call vm_fault() (which acquires locks and might sleep) with critnest > 0 (which means, don't reschedule the current thread).

amd64 uses spinlock_count > 0 as a proxy for "are interrupts disabled?", which is close enough in practice, but I don't know why it doesn't just check IF.

Also, perhaps worth pointing out arm64 does an unconditional intr_enable for the kernel too...

Until recently yes, but not anymore.

Works to me on SiFive Premier P550 (no hacky TLB flush needed anymore). Thanks!

This revision is now accepted and ready to land.Nov 21 2024, 10:08 AM

Perhaps worth mentioning in the commit message that RISC-V is unusual in allowing the TLB to cache invalid PTEs

sys/riscv/riscv/trap.c
287

Should this not be before vm_fault_trap?

markj marked an inline comment as done.

Move the check to the correct place, i.e., before the vm_fault() call.

This revision now requires review to proceed.Nov 21 2024, 4:46 PM

Perhaps worth mentioning in the commit message that RISC-V is unusual in allowing the TLB to cache invalid PTEs

I'm not sure that that's exactly what's happening - it might be that the core is caching translation structures (and this is certainly not specific to riscv) and requires a sfence.vma for PTE updates to be visible.

In fact, we might still want to issue a local sfence.vma when overwriting an invalid PTE in pmap_enter(), to avoid the overhead of spurious faults on first access. It's hard to say without some data on how common such faults are.

Perhaps worth mentioning in the commit message that RISC-V is unusual in allowing the TLB to cache invalid PTEs

I'm not sure that that's exactly what's happening - it might be that the core is caching translation structures (and this is certainly not specific to riscv) and requires a sfence.vma for PTE updates to be visible.

In fact, we might still want to issue a local sfence.vma when overwriting an invalid PTE in pmap_enter(), to avoid the overhead of spurious faults on first access. It's hard to say without some data on how common such faults are.

I managed to count amount of calls to pmap_fault(), from the moment of getting login promt, then typing "root" all the way to the loaded shell:
~1500 calls to pmap_fault with the pmap_enter patch and ~2000 without

This revision was not accepted when it landed; it landed in state Needs Review.Tue, Dec 10, 3:10 PM
This revision was automatically updated to reflect the committed changes.