arm64: Ditch arm64-specific unsound PCPU optimisation
ClosedPublic
Actions

Authored by jrtc27 on Apr 23 2026, 1:19 PM.

Details

Reviewers

andrew
manu
jhb

Group Reviewers

arm64
cheri

Commits

rG6311947d544a: arm64: Ditch arm64-specific unsound PCPU optimisation
rG9a0e8389269c: arm64: Ditch arm64-specific unsound PCPU optimisation
rGa176cf13531a: arm64: Ditch arm64-specific unsound PCPU optimisation
rG551d47c5677a: arm64: Ditch arm64-specific unsound PCPU optimisation

Summary

The current arm64 PCPU implementation uses a global register asm
variable to use x18, which we reserve with -ffixed-x18, from C. Inside a
critical_enter() or sched_pin(), it is vital that any PCPU reads use the
right PCPU pointer, as often the whole point of the critical_enter() or
sched_pin() is to ensure consistent PCPU use (e.g. for SMR it relies on
zpcpu giving the same SMR state). critical_enter() and sched_pin() both
include atomic_interrupt_fence(), i.e. asm volatile("" ::: "memory"),
barriers to ensure that memory accesses don't get moved by the compiler
outside the critical section, which on most architectures will also
order the read of the PCPU pointer itself (whether due to the read being
another asm volatile statement, or due to using a segment-relative
memory access as on x86). However, this approach on arm64 is in no sense
a memory access, and therefore the register access is not ordered with
respect to the the critical_enter() or sched_pin(), or more specifically
the curthread->td_critnest++ / curthread->td_pinned++ within.

In practice upstream today this works out ok because the read of x18 is
inlined into the actual PCPU_GET/ADD/SET memory accesses (i.e. you will
get something like ldr xN, [x18, #imm-or-xM] for PCPU_GET, etc.), and
since *that* instruction is ordered properly due to being a memory
access, the x18 ends up being read in the right place. However, that is
not in any way guaranteed, it just relies on the hope that compiler
optimisations will be perfect at inlining the use. Moreover, PCPU_PTR is
definitely not a memory access in this world, it's just pointer
arithmetic on x18, and so that has nothing ordering it. This can be
observed with the following test function compiled into the kernel:

void
pcpu_test(void)
{
        extern void __weak_symbol use_pcpu_ptr(void *);
        critical_enter();
        use_pcpu_ptr(PCPU_PTR(curthread));
        critical_exit();
}

Obviously, this is a bit contrived as you could just read curthread
directly via its atomic definition that bypasses any worries about PCPU
atomicity, but it illustrates the point. With the in-tree LLVM, this
ends up being compiled for me to:

paciasp
stp     x29, x30, [sp, #-0x10]!
mov     x29, sp
ldr     x8, [x18]
ldr     w9, [x8, #0x4fc]
mov     x0, x18
add     w9, w9, #0x1
str     w9, [x8, #0x4fc]
bl      use_pcpu_ptr
...

Note that, although the PCPU_PTR was within the critical section in the
C source, the read of x18 into x0, the argument register passed to
use_pcpu_ptr, has been hoisted to before the str, which is storing the
new, incremented, value of td_critnest to curthread, and so there is a
window within which we have to hope the thread is not preempted and
migrated to a different CPU, otherwise it will pass a pointer to the
wrong CPU's pc_curthread PCPU member.

Initially it would seem as though the solution to this would be to add
an additional barrier to critical_enter() / sched_pin() to ensure the
register reads could not be hoisted like this. However, I have not been
able to find a sequence that works reliably across both GCC and Clang,
independent of optimisation level. Using inline asm with x18 marked as a
clobber, using "=r"(pcpup), and using "+r"(pcpup) all run into various
issues; some combinations don't actually seem to be a barrier, and for
Clang at -O0 some combinations will actually generate writes to x18*, at
which point you then have to hope that the kernel is compiled with
optimisations, and that the redundant writes are optimised away such
that x18 is just passed through. But that just gets us back to hoping
optimisation works, which isn't a solution to the problem, it just
trades one point of fragility for another.

In talking to GCC developers, who seemed rather horrified by the
implications of trying to do this (which is effectively "register
volatile", a combination that's explicitly forbidden), we could not find
a solution to this, and so I have concluded that the only reliable to
have a sound PCPU implementation is to ditch this optimisation and
follow other non-x86 architectures in using inline asm in one form or
another; specifically, this adopts riscv's approach of just calling
get_pcpu(), which, curiously, was already implemented in inline asm here
on arm64, rather than reading pcpup.

Anyone who feels strongly enough about PCPU performance is welcome to
try to find a working approach, but such proposals should be heavily
scrutinised to be certain that they won't come back to bite us in
future. In particular, this caused a lot of problems downstream in
CheriBSD's experimental compartmentalised kernel, which is trialling
interposing on PCPU accesses in order to restrict access within
compartments. As a result, even PCPU_GET/SET/ADD can look like PCPU_PTR,
as they pass an opaque PCPU reference to wrapper functions, and so this
case gets hit all over the kernel, giving highly-confusing panics with
locks that aren't owned by the current CPU or SMR use allegedly not
within an smr_enter().

For "+r"(pcpup), Clang's initial code generation is to do:

mov xTtmp1, x18 mov x18, xTmp1 /* asm (empty) */ mov xTmp2, x18 mov x18, xTmp2

since its interpretation of what that means is "read the value of pcpup, and make sure that value is in x18 for the duration of the assembly due to the asm("x18") on pcpup", and similarly for the output side.

MFC after: 1 week

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

jrtc27 created this revision.Apr 23 2026, 1:19 PM

Herald added a reviewer: manu. · View Herald TranscriptApr 23 2026, 1:19 PM

Herald added subscribers: emaste, imp. · View Herald Transcript

jrtc27 requested review of this revision.Apr 23 2026, 1:19 PM

Herald added a reviewer: cheri. · View Herald TranscriptApr 23 2026, 1:19 PM

Harbormaster completed remote builds in B72468: Diff 176250.Apr 23 2026, 1:19 PM

jrtc27 edited the summary of this revision. (Show Details)Apr 23 2026, 1:34 PM

One nit: s/owned by the current CPU/owned by the current thread/ in the last paragraph. In general locking primitives in the kernel are "owned" by threads (we use curthread as the cookie) even though it is true that for spin mutexes they are defacto "owned" by CPUs as well.

I think the Alpha port might have also used this method for per-CPU access btw, but I think it is the only other architecture I can remember doing so: https://github.com/freebsd/freebsd-src/blob/stable/6/sys/alpha/include/pcpu.h#L45

Hmm, nope, I was wrong:

ia64 also declared a pcpu variable but used inline assembly for all accesses, though that is due to basically finding this same problem all over again: https://github.com/freebsd/freebsd-src/commit/e31ece45b7a40e790cea052a8e6765e2540eb29b

This revision is now accepted and ready to land.Apr 23 2026, 2:09 PM

andrew accepted this revision.Apr 27 2026, 11:42 AM

Closed by commit rG551d47c5677a: arm64: Ditch arm64-specific unsound PCPU optimisation (authored by jrtc27). · Explain WhyApr 27 2026, 12:54 PM

This revision was automatically updated to reflect the committed changes.

jrtc27 added a commit: rG551d47c5677a: arm64: Ditch arm64-specific unsound PCPU optimisation.

jrtc27 added a commit: rGa176cf13531a: arm64: Ditch arm64-specific unsound PCPU optimisation.Apr 30 2026, 5:43 PM

jrtc27 added a commit: rG9a0e8389269c: arm64: Ditch arm64-specific unsound PCPU optimisation.Apr 30 2026, 5:49 PM

jrtc27 added a commit: rG6311947d544a: arm64: Ditch arm64-specific unsound PCPU optimisation.Apr 30 2026, 5:54 PM