Page MenuHomeFreeBSD

riscv/pmap.c: Avoid spurious kernel page faults when inserting new PTEs
Needs ReviewPublic

Authored by bnovkov on Sun, May 10, 4:21 PM.
Tags
None
Referenced Files
F157043012: D56925.diff
Mon, May 18, 2:33 AM
Unknown Object (File)
Sun, May 17, 8:43 AM
Unknown Object (File)
Wed, May 13, 10:38 PM
Unknown Object (File)
Wed, May 13, 8:38 PM
Unknown Object (File)
Wed, May 13, 3:17 PM
Unknown Object (File)
Tue, May 12, 10:24 PM
Unknown Object (File)
Tue, May 12, 2:46 PM
Unknown Object (File)
Tue, May 12, 1:58 PM
Subscribers

Details

Reviewers
jrtc27
markj
Group Reviewers
riscv
Summary

The Privileged ISA specification permits caching of invalid PTEs
12.2.1. Supervisor Memory-Management Fence Instruction), which may
result in a spurious page fault. Such faults are handled by 'pmap_fault'
which locks the kernel pmap before inspecting and possibly updating the
offending L2 entry.
Unfortunately, spurious faults may also occur in non-sleepable
contexts (e.g., while holding a spinlock) where any attempt to grab
the pmap lock will result in a kernel panic.

Fix this avoidable panic by preemptively flushing the appropriate TLB
entry when inserting new PTEs into the kernel pmap.

Test Plan

This type of spurious fault was consistently observed on each boot during my initial attempts to boot -CURRENT on a Banana Pi BPI-F3 board.
Every boot would inevitably hit the same panic in vtbuf_grow():

panic: acquiring blockable sleep lock with spinlock or critical section held (sleep mutex) pmap @ /usr/src/sys/riscv/riscv/pmap.c:2944
cpuid = 0
time = 1
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x36
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x16e
panic() at panic+0x26
witness_checkorder() at witness_checkorder+0xb70
__mtx_lock_flags() at __mtx_lock_flags+0x86
pmap_fault() at pmap_fault+0x4e
page_fault_handler() at page_fault_handler+0x11e
do_trap_supervisor() at do_trap_supervisor+0x6c
cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0x74
--- exception 15, tval = 0xffffffc05d4263b8
memmove() at memmove+0xd6
vtbuf_grow() at vtbuf_grow+0x256
vt_change_font() at vt_change_font+0xfc
vt_resize() at vt_resize+0x88
vt_upgrade() at vt_upgrade+0x56a
mi_startup() at mi_startup+0x1e6
va() at va+0x60
KDB: enter: panic
[ thread pid 0 tid 100000 ]
Stopped at      kdb_enter+0x3a: sd      zero,316(s1)
db>

vtbuf_grow holds a spinlock while growing the buffer and consistently triggers a spurious page fault while copying the old rows, resulting in the panic listed above.
Applying this patch and D57003 fixes the boot sequence:

VT(efifb): resolution 1920x1080
SBI: OpenSBI v1.3
SBI Specification Version: 1.0
CPU 0  : Vendor=SpacemiT Core=SpacemiT(R) X60 (Hart 0)
  marchid=0x8000000058000001, mimpid=0x1000000049772200
  MMU: 0x1<Sv39>
  ISA: 0x112f<Atomic,Compressed,Double,Float,Mult/Div>
  S-mode Extensions: 0x1f<Sstc,Svnapot,Svpbmt,Svinval,Sscofpmf>
[[snip]]
sbi0: <RISC-V Supervisor Binary Interface>
cpulist0: <Open Firmware CPU Group> on ofwbus0
cpu0: <Open Firmware CPU> on cpulist0
intc0: <RISC-V Local Interrupt Controller> on ofwbus0
sbi_ipi0: <RISC-V SBI Inter-Processor Interrupts> on sbi0
plic0: <RISC-V PLIC> mem 0xe0000000-0xe3ffffff irq 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24 on simplebus0
timer0: <RISC-V Timer>
Timecounter "RISC-V Timecounter" frequency 24000000 Hz quality 1000
Event timer "RISC-V Eventtimer" frequency 24000000 Hz quality 1000
rcons0: <RISC-V console>
Timecounters tick every 1.000 msec
usb_needs_explore_all: no devclass
sbi_ipi0: using for IPIs
Release APs
[[snip]]
mountroot>

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

jrtc27 requested changes to this revision.EditedSun, May 10, 4:32 PM
jrtc27 added a subscriber: jrtc27.

The Vector spec does not require any ordering on the memory accesses performed, so a legal implementation could alternate which address faults and this would never make forward progress.

But also, I don't think we should be doing anything that would allow spurious kernel faults like this in the first place? For userspace, sure, but for the kernel?

This revision now requires changes to proceed.Sun, May 10, 4:32 PM

The Vector spec does not require any ordering on the memory accesses performed, so a legal implementation could alternate which address faults and this would never make forward progress.

But also, I don't think we should be doing anything that would allow spurious kernel faults like this in the first place? For userspace, sure, but for the kernel?

Well, e.g., pmap_enter() doesn't issue sfence_vma() after installing the new PTE, so Bojan's explanation seems plausible. Does the panic go away if we do that instead, something like the patch below?

diff --git a/sys/riscv/riscv/pmap.c b/sys/riscv/riscv/pmap.c
index 1f3fd8c24b81..a74d99d5646b 100644
--- a/sys/riscv/riscv/pmap.c
+++ b/sys/riscv/riscv/pmap.c
@@ -3466,6 +3466,8 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
                        vm_page_dirty(m);
        } else {
                pmap_store(l3, new_l3);
+               if (pmap == kernel_pmap)
+                       sfence_vma(); /* only updates the local TLB */
        }
 
 #if VM_NRESERVLEVEL > 0

Alternately, we can try to make pmap_fault() avoid acquiring the lock for the kernel pmap, see how it uses pmap_klookup() on arm64 to deal with a similar problem.

The Vector spec does not require any ordering on the memory accesses performed, so a legal implementation could alternate which address faults and this would never make forward progress.

Thank you for pointing this out, I wasn't aware of that limitation.

Well, e.g., pmap_enter() doesn't issue sfence_vma() after installing the new PTE, so Bojan's explanation seems plausible. Does the panic go away if we do that instead, something like the patch below?
Alternately, we can try to make pmap_fault() avoid acquiring the lock for the kernel pmap, see how it uses pmap_klookup() on arm64 to deal with a similar problem.

This does fix the vtbuf panic, but I immediately run into a similar thing when the APs are launched:

panic: mtx_lock() by idle thread 0xffffffc05ce01140 on mutex 0xffffffc000bd8c80 @ /usr/src/sys/riscv/riscv/pmap.c:2942
cpuid = 2
time = 1
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x36
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x134
panic() at panic+0x26
$x() at $x+0x6e
pmap_fault() at pmap_fault+0x52
page_fault_handler() at page_fault_handler+0x11e
do_trap_supervisor() at do_trap_supervisor+0x6c
cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0x74
--- exception 13, tval = 0xffffffc002719210
cpu_search_highest() at cpu_search_highest+0xdc
sched_ule_idletd() at sched_ule_idletd+0x154
fork_exit() at fork_exit+0x68
fork_trampoline() at fork_trampoline+0xa
Uptime: 1s
timeout stopping cpus
panic: mtx_lock() by idle thread 0xffffffc05ce

However even if it did fix those panics, I still don't think it would've been an adequate solution since that situation (i.e., spurious fault in a non-pmap lock-safe context) can still happen.
Thank you for suggesting pmap_klookup, I think something along those lines is the proper solution here, I'll see what I can do.

The Vector spec does not require any ordering on the memory accesses performed, so a legal implementation could alternate which address faults and this would never make forward progress.

Thank you for pointing this out, I wasn't aware of that limitation.

Well, e.g., pmap_enter() doesn't issue sfence_vma() after installing the new PTE, so Bojan's explanation seems plausible. Does the panic go away if we do that instead, something like the patch below?
Alternately, we can try to make pmap_fault() avoid acquiring the lock for the kernel pmap, see how it uses pmap_klookup() on arm64 to deal with a similar problem.

This does fix the vtbuf panic, but I immediately run into a similar thing when the APs are launched:

panic: mtx_lock() by idle thread 0xffffffc05ce01140 on mutex 0xffffffc000bd8c80 @ /usr/src/sys/riscv/riscv/pmap.c:2942
cpuid = 2
time = 1
KDB: stack backtrace:
db_trace_self() at db_trace_self
db_trace_self_wrapper() at db_trace_self_wrapper+0x36
kdb_backtrace() at kdb_backtrace+0x2c
vpanic() at vpanic+0x134
panic() at panic+0x26
$x() at $x+0x6e
pmap_fault() at pmap_fault+0x52
page_fault_handler() at page_fault_handler+0x11e
do_trap_supervisor() at do_trap_supervisor+0x6c
cpu_exception_handler_supervisor() at cpu_exception_handler_supervisor+0x74
--- exception 13, tval = 0xffffffc002719210
cpu_search_highest() at cpu_search_highest+0xdc
sched_ule_idletd() at sched_ule_idletd+0x154
fork_exit() at fork_exit+0x68
fork_trampoline() at fork_trampoline+0xa
Uptime: 1s
timeout stopping cpus
panic: mtx_lock() by idle thread 0xffffffc05ce

However even if it did fix those panics, I still don't think it would've been an adequate solution since that situation (i.e., spurious fault in a non-pmap lock-safe context) can still happen.
Thank you for suggesting pmap_klookup, I think something along those lines is the proper solution here, I'll see what I can do.

I'll point out though that there's a lot of complexity introduced by the possibility of transient faults on arbitrary kernel addresses. Like, if it's possible to take a transient fault while dereferencing curthread, life becomes tricky; see the uses of NO_PERTHREAD_SSP in arm64 platform code. So I do wonder if it wouldn't be better to just issue a TLB shootdown when a kernel_pmap PTE changes from invalid to valid.

bnovkov retitled this revision from riscv/pmap.c: Handle spurious kernel page faults in non-sleepable contexts to riscv/pmap.c: Avoid spurious kernel page faults when inserting new PTEs.
bnovkov edited the summary of this revision. (Show Details)
bnovkov edited the test plan for this revision. (Show Details)

Rework patch - flushing the TLB when inserting a new mapping fixes the original issue.

sys/riscv/riscv/pmap.c
3480

If you're going with this approach, don't you need to do this on all harts?

bnovkov marked an inline comment as done.

Address @markj 's comments.

sys/riscv/riscv/pmap.c
3480

hm, issuing a TLB shootdown is what I did initially until some premature optimization thoughts kicked in because that seemed a bit excessive.
But, I think you're right, this should be done on all harts.

sys/riscv/riscv/pmap.c
3480

Unfortunately there are other places that need to be updated: pmap_enter_l2(), pmap_enter_quick(), ???

pmap_qenter() already handles this.

Thinking about this some more, I suspect that the problems with dereferencing curthread on arm64 aren't going to arise even if you solve this problem with a lockless lookup for the kernel pmap in the fault handler: pmap_activate() issues sfence_vma(), so when a new thread first runs, it'll have already issued a fence, which should be enough.