riscv: Permit spurious faults in kernel mode
ClosedPublic
Actions

Authored by markj on Nov 20 2024, 6:23 PM.

Details

Reviewers

mhorne
br
jhb
jrtc27

Group Reviewers

riscv

Commits

rGf9275f94f33b: riscv: Permit spurious faults in kernel mode
rGc226f193515c: riscv: Permit spurious faults in kernel mode

Summary

Right now, pmap_enter() does not issue an sfence.vma after overwriting
an invalid PTE, so the kernel can trigger a page fault when accessing a
freshly created mapping. In this case, pmap_fault() can handle the
exception, but we may panic before that. Move the check; this is
consistent with arm64 and serves to ensure that we don't call vm_fault()
etc. from a context where that's not permitted.

Also fix a related bug: don't enable interrupts if they were disabled in
the context where the exception occurred.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

markj created this revision.Nov 20 2024, 6:23 PM

Herald added subscribers: riscv, imp. · View Herald TranscriptNov 20 2024, 6:23 PM

markj requested review of this revision.Nov 20 2024, 6:23 PM

Harbormaster completed remote builds in B60707: Diff 146760.Nov 20 2024, 6:23 PM

markj added reviewers: jrtc27, riscv.Nov 20 2024, 6:24 PM

Presumably in practice the td_critnest check meant the interrupt enabling didn't matter, as spinlock_enter increments it? I assume there aren't many places where we disable interrupts without also being inside a critical section, so it would only be a problem if that sequence of code itself took a spurious fault.

Also, perhaps worth pointing out arm64 does an unconditional intr_enable for the kernel too...

And amd64 has:

if ((frame->tf_rflags & PSL_I) == 0) {
        /*
         * Buggy application or kernel code has disabled
         * interrupts and then trapped.  Enabling interrupts
         * now is wrong, but it is better than running with
         * interrupts disabled until they are accidentally
         * enabled later.
         */
        if (TRAPF_USERMODE(frame)) {
                uprintf(
                    "pid %ld (%s): trap %d (%s) "
                    "with interrupts disabled\n",
                    (long)curproc->p_pid, curthread->td_name, type,
                    trap_msg[type]);
        } else {
                switch (type) {
                case T_NMI:
                case T_BPTFLT:
                case T_TRCTRAP:
                case T_PROTFLT:
                case T_SEGNPFLT:
                case T_STKFLT:
                        break;
                default:
                        printf(
                            "kernel trap %d with interrupts disabled\n",
                            type);

                        /*
                         * We shouldn't enable interrupts while holding a
                         * spin lock.
                         */
                        if (td->td_md.md_spinlock_count == 0)
                                enable_intr();
                }
        }
}

So I think the expectation is really that this shouldn't happen. Whether that's true or not I don't know :)

In D47688#1087531, @jrtc27 wrote:

Presumably in practice the td_critnest check meant the interrupt enabling didn't matter, as spinlock_enter increments it? I assume there aren't many places where we disable interrupts without also being inside a critical section, so it would only be a problem if that sequence of code itself took a spurious fault.

spinlock_enter() increments td_critnest, but so does critical_enter(), which doesn't disable interrupts. The point of the critnest check is, I think, to make sure we don't do anything which might involve rescheduling the current thread. In particular, it's invalid to call vm_fault() (which acquires locks and might sleep) with critnest > 0 (which means, don't reschedule the current thread).

amd64 uses spinlock_count > 0 as a proxy for "are interrupts disabled?", which is close enough in practice, but I don't know why it doesn't just check IF.

In D47688#1087532, @jrtc27 wrote:

Also, perhaps worth pointing out arm64 does an unconditional intr_enable for the kernel too...

Until recently yes, but not anymore.

Works to me on SiFive Premier P550 (no hacky TLB flush needed anymore). Thanks!

This revision is now accepted and ready to land.Nov 21 2024, 10:08 AM

Perhaps worth mentioning in the commit message that RISC-V is unusual in allowing the TLB to cache invalid PTEs

sys/riscv/riscv/trap.c
287	Should this not be before vm_fault_trap?

Move the check to the correct place, i.e., before the vm_fault() call.

This revision now requires review to proceed.Nov 21 2024, 4:46 PM

Harbormaster completed remote builds in B60714: Diff 146804.Nov 21 2024, 4:46 PM

In D47688#1087739, @jrtc27 wrote:

Perhaps worth mentioning in the commit message that RISC-V is unusual in allowing the TLB to cache invalid PTEs

I'm not sure that that's exactly what's happening - it might be that the core is caching translation structures (and this is certainly not specific to riscv) and requires a sfence.vma for PTE updates to be visible.

In fact, we might still want to issue a local sfence.vma when overwriting an invalid PTE in pmap_enter(), to avoid the overhead of spurious faults on first access. It's hard to say without some data on how common such faults are.

In D47688#1087747, @markj wrote:

In D47688#1087739, @jrtc27 wrote:

Perhaps worth mentioning in the commit message that RISC-V is unusual in allowing the TLB to cache invalid PTEs

I'm not sure that that's exactly what's happening - it might be that the core is caching translation structures (and this is certainly not specific to riscv) and requires a sfence.vma for PTE updates to be visible.

In fact, we might still want to issue a local sfence.vma when overwriting an invalid PTE in pmap_enter(), to avoid the overhead of spurious faults on first access. It's hard to say without some data on how common such faults are.

I managed to count amount of calls to pmap_fault(), from the moment of getting login promt, then typing "root" all the way to the loaded shell:
~1500 calls to pmap_fault with the pmap_enter patch and ~2000 without

This revision was not accepted when it landed; it landed in state Needs Review.Dec 10 2024, 3:10 PM

Closed by commit rGc226f193515c: riscv: Permit spurious faults in kernel mode (authored by markj). · Explain Why

This revision was automatically updated to reflect the committed changes.

markj added a commit: rGc226f193515c: riscv: Permit spurious faults in kernel mode.

markj added a commit: rGf9275f94f33b: riscv: Permit spurious faults in kernel mode.Jan 4 2025, 2:08 PM