Details

Reviewers

jrtc27
markj
mhorne

Group Reviewers

riscv

Commits

rG7a5ef20a977c: riscv/pmap.c: Handle spurious kernel page faults in critical sections

Summary

The Privileged ISA specification permits caching of invalid PTEs
12.2.1. Supervisor Memory-Management Fence Instruction), which may
result in a spurious page fault. Such faults are handled by 'pmap_fault'
which locks the kernel pmap before inspecting and possibly updating the
offending L2 entry.
Unfortunately, spurious faults may also occur in non-sleepable
contexts (e.g., while holding a spinlock) where any attempt to grab
the pmap lock will result in a kernel panic.

Fix this avoidable panic by preemptively flushing the appropriate TLB
entry when inserting new PTEs into the kernel pmap.

Test Plan

This type of spurious fault was consistently observed on each boot during my initial attempts to boot -CURRENT on a Banana Pi BPI-F3 board.
Every boot would inevitably hit the same panic in vtbuf_grow():

panic: acquiring blockable sleep lock with spinlock or critical section held (sleep mutex) pmap @ /usr/src/sys/riscv/riscv/pmap.c:2944
cpuid = 0
time = 1
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x36
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x16e
panic() at panic+0x26
witness_checkorder() at witness_checkorder+0xb70
__mtx_lock_flags() at __mtx_lock_flags+0x86
pmap_fault() at pmap_fault+0x4e
page_fault_handler() at page_fault_handler+0x11e
do_trap_supervisor() at do_trap_supervisor+0x6c
cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0x74
--- exception 15, tval = 0xffffffc05d4263b8
memmove() at memmove+0xd6
vtbuf_grow() at vtbuf_grow+0x256
vt_change_font() at vt_change_font+0xfc
vt_resize() at vt_resize+0x88
vt_upgrade() at vt_upgrade+0x56a
mi_startup() at mi_startup+0x1e6
va() at va+0x60
KDB: enter: panic
[ thread pid 0 tid 100000 ]
Stopped at      kdb_enter+0x3a: sd      zero,316(s1)
db>

vtbuf_grow holds a spinlock while growing the buffer and consistently triggers a spurious page fault while copying the old rows, resulting in the panic listed above.
Applying this patch and D57003 fixes the boot sequence:

VT(efifb): resolution 1920x1080
SBI: OpenSBI v1.3
SBI Specification Version: 1.0
CPU 0  : Vendor=SpacemiT Core=SpacemiT(R) X60 (Hart 0)
  marchid=0x8000000058000001, mimpid=0x1000000049772200
  MMU: 0x1<Sv39>
  ISA: 0x112f<Atomic,Compressed,Double,Float,Mult/Div>
  S-mode Extensions: 0x1f<Sstc,Svnapot,Svpbmt,Svinval,Sscofpmf>
[[snip]]
sbi0: <RISC-V Supervisor Binary Interface>
cpulist0: <Open Firmware CPU Group> on ofwbus0
cpu0: <Open Firmware CPU> on cpulist0
intc0: <RISC-V Local Interrupt Controller> on ofwbus0
sbi_ipi0: <RISC-V SBI Inter-Processor Interrupts> on sbi0
plic0: <RISC-V PLIC> mem 0xe0000000-0xe3ffffff irq 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24 on simplebus0
timer0: <RISC-V Timer>
Timecounter "RISC-V Timecounter" frequency 24000000 Hz quality 1000
Event timer "RISC-V Eventtimer" frequency 24000000 Hz quality 1000
rcons0: <RISC-V console>
Timecounters tick every 1.000 msec
usb_needs_explore_all: no devclass
sbi_ipi0: using for IPIs
Release APs
[[snip]]
mountroot>

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

bnovkov created this revision.May 10 2026, 4:21 PM

Herald added a subscriber: imp. · View Herald TranscriptMay 10 2026, 4:21 PM

bnovkov requested review of this revision.May 10 2026, 4:21 PM

Harbormaster completed remote builds in B72974: Diff 177547.May 10 2026, 4:21 PM

bnovkov added a parent revision: D56924: riscv/pmap.c: Add a ddb command for inspecting PTE hierachy.May 10 2026, 4:21 PM

bnovkov edited the test plan for this revision. (Show Details)May 10 2026, 4:31 PM

The Vector spec does not require any ordering on the memory accesses performed, so a legal implementation could alternate which address faults and this would never make forward progress.

But also, I don't think we should be doing anything that would allow spurious kernel faults like this in the first place? For userspace, sure, but for the kernel?

This revision now requires changes to proceed.May 10 2026, 4:32 PM

jrtc27 added a reviewer: markj.May 10 2026, 4:33 PM

In D56925#1303888, @jrtc27 wrote:

The Vector spec does not require any ordering on the memory accesses performed, so a legal implementation could alternate which address faults and this would never make forward progress.

But also, I don't think we should be doing anything that would allow spurious kernel faults like this in the first place? For userspace, sure, but for the kernel?

Well, e.g., pmap_enter() doesn't issue sfence_vma() after installing the new PTE, so Bojan's explanation seems plausible. Does the panic go away if we do that instead, something like the patch below?

diff --git a/sys/riscv/riscv/pmap.c b/sys/riscv/riscv/pmap.c
index 1f3fd8c24b81..a74d99d5646b 100644
--- a/sys/riscv/riscv/pmap.c
+++ b/sys/riscv/riscv/pmap.c
@@ -3466,6 +3466,8 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
                        vm_page_dirty(m);
        } else {
                pmap_store(l3, new_l3);
+               if (pmap == kernel_pmap)
+                       sfence_vma(); /* only updates the local TLB */
        }
 
 #if VM_NRESERVLEVEL > 0

Alternately, we can try to make pmap_fault() avoid acquiring the lock for the kernel pmap, see how it uses pmap_klookup() on arm64 to deal with a similar problem.

In D56925#1303888, @jrtc27 wrote:

The Vector spec does not require any ordering on the memory accesses performed, so a legal implementation could alternate which address faults and this would never make forward progress.

Thank you for pointing this out, I wasn't aware of that limitation.

In D56925#1303898, @markj wrote:

Well, e.g., pmap_enter() doesn't issue sfence_vma() after installing the new PTE, so Bojan's explanation seems plausible. Does the panic go away if we do that instead, something like the patch below?
Alternately, we can try to make pmap_fault() avoid acquiring the lock for the kernel pmap, see how it uses pmap_klookup() on arm64 to deal with a similar problem.

This does fix the vtbuf panic, but I immediately run into a similar thing when the APs are launched:

panic: mtx_lock() by idle thread 0xffffffc05ce01140 on mutex 0xffffffc000bd8c80 @ /usr/src/sys/riscv/riscv/pmap.c:2942
cpuid = 2
time = 1
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x36
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x134
panic() at panic+0x26
$x() at $x+0x6e
pmap_fault() at pmap_fault+0x52
page_fault_handler() at page_fault_handler+0x11e
do_trap_supervisor() at do_trap_supervisor+0x6c
cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0x74
--- exception 13, tval = 0xffffffc002719210
cpu_search_highest() at cpu_search_highest+0xdc
sched_ule_idletd() at sched_ule_idletd+0x154
fork_exit() at fork_exit+0x68
fork_trampoline() at fork_trampoline+0xa
Uptime: 1s
timeout stopping cpus
panic: mtx_lock() by idle thread 0xffffffc05ce

However even if it did fix those panics, I still don't think it would've been an adequate solution since that situation (i.e., spurious fault in a non-pmap lock-safe context) can still happen.
Thank you for suggesting pmap_klookup, I think something along those lines is the proper solution here, I'll see what I can do.

In D56925#1303962, @bnovkov wrote:
In D56925#1303888, @jrtc27 wrote:

The Vector spec does not require any ordering on the memory accesses performed, so a legal implementation could alternate which address faults and this would never make forward progress.

Thank you for pointing this out, I wasn't aware of that limitation.

In D56925#1303898, @markj wrote:

Well, e.g., pmap_enter() doesn't issue sfence_vma() after installing the new PTE, so Bojan's explanation seems plausible. Does the panic go away if we do that instead, something like the patch below?
Alternately, we can try to make pmap_fault() avoid acquiring the lock for the kernel pmap, see how it uses pmap_klookup() on arm64 to deal with a similar problem.

This does fix the vtbuf panic, but I immediately run into a similar thing when the APs are launched:
panic: mtx_lock() by idle thread 0xffffffc05ce01140 on mutex 0xffffffc000bd8c80 @ /usr/src/sys/riscv/riscv/pmap.c:2942
cpuid = 2
time = 1
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x36
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x134
panic() at panic+0x26
$x() at $x+0x6e
pmap_fault() at pmap_fault+0x52
page_fault_handler() at page_fault_handler+0x11e
do_trap_supervisor() at do_trap_supervisor+0x6c
cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0x74
--- exception 13, tval = 0xffffffc002719210
cpu_search_highest() at cpu_search_highest+0xdc
sched_ule_idletd() at sched_ule_idletd+0x154
fork_exit() at fork_exit+0x68
fork_trampoline() at fork_trampoline+0xa
Uptime: 1s
timeout stopping cpus
panic: mtx_lock() by idle thread 0xffffffc05ce
However even if it did fix those panics, I still don't think it would've been an adequate solution since that situation (i.e., spurious fault in a non-pmap lock-safe context) can still happen.
Thank you for suggesting pmap_klookup, I think something along those lines is the proper solution here, I'll see what I can do.

I'll point out though that there's a lot of complexity introduced by the possibility of transient faults on arbitrary kernel addresses. Like, if it's possible to take a transient fault while dereferencing curthread, life becomes tricky; see the uses of NO_PERTHREAD_SSP in arm64 platform code. So I do wonder if it wouldn't be better to just issue a TLB shootdown when a kernel_pmap PTE changes from invalid to valid.

Rework patch - flushing the TLB when inserting a new mapping fixes the original issue.

markj added inline comments.May 14 2026, 4:07 PM

sys/riscv/riscv/pmap.c
3504	If you're going with this approach, don't you need to do this on all harts?

bnovkov edited the test plan for this revision. (Show Details)May 14 2026, 4:09 PM

Address @markj 's comments.

sys/riscv/riscv/pmap.c
3504	hm, issuing a TLB shootdown is what I did initially until some premature optimization thoughts kicked in because that seemed a bit excessive. But, I think you're right, this should be done on all harts.

markj added inline comments.May 14 2026, 6:05 PM

sys/riscv/riscv/pmap.c
3504	Unfortunately there are other places that need to be updated: pmap_enter_l2(), pmap_enter_quick(), ??? pmap_qenter() already handles this. Thinking about this some more, I suspect that the problems with dereferencing curthread on arm64 aren't going to arise even if you solve this problem with a lockless lookup for the kernel pmap in the fault handler: pmap_activate() issues sfence_vma(), so when a new thread first runs, it'll have already issued a fence, which should be enough.

Address @markj 's comment.

The approach seems sensible to me, given the discussion above.

I think the same/similar short explanatory comment ought to be added to all instances of this handling, especially since the behaviour deviates from other pmap implementations.

Between you and markj I will trust that all relevant entry points are covered here.

markj added inline comments.Jun 4 2026, 3:33 PM

sys/riscv/riscv/pmap.c
545	Don't you need a fence here too?
3504	I meant that doing a lockless lookup from pmap_fault() is probably better IMO than annotating all of the places where we install a kernel PTE. The latter seems to be rather tricky, see my other comments.
3504	As I understand it, pmap_invalidate_page() will just issue `sfence.vma <va>` on all harts, and this only invalidates cached translation structures for leaf page table entries, not higher level entries which may have been installed as part of this operation.
3930	How can oldl3 be 0 here?

Address @markj's comments.

@markj after chasing down a few issues you pointed out, I've decided to opt for the pmap_klookup approach instead since I think it's a more straightforward solution.

The patch survived a full buildkernel run.

sys/riscv/riscv/pmap.c
3504	I meant that doing a lockless lookup from pmap_fault() is probably better IMO than annotating all of the places where we install a kernel PTE. The latter seems to be rather tricky, see my other comments. Ah, sorry, I didn't get that from your initial comment, I'd also prefer a pmap_klookup() equivalent.

markj added inline comments.Jun 12 2026, 12:06 AM

sys/riscv/riscv/pmap.c
2915	I would suggest making this look more like `pmap_pte()` on amd64 and arm64: return the PTE directly, or NULL if there is no valid PTE for the VA. Let the caller load the value.
2938	This comment could use more detail: we don't perform an sfence.vma after storing a PTE, so spurious faults are possible, and locking the kernel pmap is unsafe in this context; we might be in a critical section, or we might already hold the pmap lock. The pmap lock is a non-sleepable lock (it's just a mtx(9)), so the bit about a non-sleepable context doesn't quite make sense. (I realize that the witness message in the review description refers to the lock as a "sleep mutex", which is kind of confusing. That just means it isn't a spin mutex.) It's safe to look up the PTE without the pmap lock because kernel PTPs are always preallocated and never freed.
2947	It's possible for PTE_D to be unset too: if an exec args buffer is paged out, the mappings will be marked clean.

Address @markj 's comments.

markj added inline comments.Jun 14 2026, 4:30 PM

sys/riscv/riscv/pmap.c
2915	Having _unlocked in the name is kind of strange, since the pmap will only be unlocked if it's the kernel pmap. I think pmap_fault_lookup() is fine.
2951	The comment is just restating the assertion.
2955	Now this assertion is effectively stating that the PTE has both the R and W bits set. Why should we assert that? What if the mapping is read-only? I don't think it's useful to assert anything in particular. I think the control flow can be simpler: Is `(oldpte & PTE_KERN) == PTE_KERN`? If so, fence and return. This is not a spurious fault, acquire the pmap lock and try again to resolve the fault.

Simplify control flow

mhorne accepted this revision.Wed, Jun 24, 1:44 PM

mhorne added inline comments.

sys/riscv/riscv/pmap.c
2945	Maybe?

markj accepted this revision.Wed, Jun 24, 2:17 PM

This revision was not accepted when it landed; it landed in state Needs Review.Fri, Jul 10, 4:04 PM

Closed by commit rG7a5ef20a977c: riscv/pmap.c: Handle spurious kernel page faults in critical sections (authored by bnovkov). · Explain Why

This revision was automatically updated to reflect the committed changes.

bnovkov added a commit: rG7a5ef20a977c: riscv/pmap.c: Handle spurious kernel page faults in critical sections.

riscv/pmap.c: Avoid spurious kernel page faults when inserting new PTEs
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 181728

sys/riscv/riscv/pmap.c

riscv/pmap.c: Avoid spurious kernel page faults when inserting new PTEsClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 181728

sys/riscv/riscv/pmap.c

riscv/pmap.c: Avoid spurious kernel page faults when inserting new PTEs
ClosedPublic
Actions

Revision Contents
Changeset List