Page MenuHomeFreeBSD

x86: add a safe variant of MSR_OP_SCHED* operations
ClosedPublic

Authored by kib on Feb 2 2026, 12:01 AM.
Tags
None
Referenced Files
F146590440: D55045.id171115.diff
Tue, Mar 3, 10:08 PM
Unknown Object (File)
Mon, Mar 2, 12:23 PM
Unknown Object (File)
Sun, Mar 1, 7:19 AM
Unknown Object (File)
Mon, Feb 23, 5:59 PM
Unknown Object (File)
Mon, Feb 23, 4:18 AM
Unknown Object (File)
Wed, Feb 18, 2:52 PM
Unknown Object (File)
Wed, Feb 18, 2:56 AM
Unknown Object (File)
Tue, Feb 17, 10:40 PM
Subscribers

Details

Summary
The modifier executes the operation using msr{read,write}_safe()
functions instead of plain msr reads and writes.  Returns EFAULT if any
MSR access caused #GP.

x86_msr_op(9): consistently return the value read from MSR

If the operation is executed on more than one CPU, a random instance of
the read value is returned.

x86_msr_op_one((9): execute SCHED ops in critical section and with realtime prio

to ensure that thread gets scheduled on the target CPUs, and not moved
from them while doing non-atomic operations.


x86: provide extended description for x86_msr_op(9)

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

kib edited the summary of this revision. (Show Details)
kib added reviewers: markj, olce.

finalize the change

Minor clarifications to the description.

I do not think that converting the comment into a man page is warranted.

Execute SCHED ops in critical section

kib edited the summary of this revision. (Show Details)

Fix locking error with sched_bind()

I'm again not sure if this is the direction we want to go. As x86_msr_op() is about a single (read or read-modify-write) operation, it has a very short execution time. Not only using sched_bind() seems a big hammer for that (it has higher overhead), but also depending on the priority of the calling thread execution might be delayed a significant amount of time (with a real time thread already executing there, possibly indefinitely). To me, and IIRC one of our discussions for markj@ as well, the direction was more to deprecate using sched_bind() for this purpose, eventually retiring all MSR_OP_SCHED_* variants, whereas that patch *adds* new ones.

And, for a safe variant to work with MSR_OP_RENDEZVOUS*, then there's no other choice than to be able to take a fault on MSR access while interrupts are disabled, so we are back to D54996's discussion. I'm not sure why you mentioned GSBASE in the discussion; I read some of the interrupt handling assembly code long ago, and although I don't recall well I thought that the dances around this were already performed there, before we even reach trap(). And why faulting from controlled addresses, as is the case for rdmsr_safe() or wrmsr_safe(), even with interrupts disabled, would be wrong?

This discussion is really wider than just the MSR_OP_* topic, as we are already using *msr_safe() in callbacks triggered by smp_rendezvous() in at least hwpstate_amd(4).

In addition, for the potential use cases I'm seeing in hwpstate_amd(4) (but don't want to insist too much on this example, since we are using custom handlers which are probably not replaceable: We are rarely accessing a single MSR at a time, and when we modify a single one, we start from some cached value that we want to access from the target CPU; additionally, execution on some target CPU with interrupts disabled is also used as a mean to atomically modify the softc (a critical section would suffice though)), I'd like to:

  • Use the rdmsr_safe() and wrmsr_safe() variants always.
  • When specifying a read-modify-write operation, on failure, know if it's the read or only the write that failed (this may prove important for MSR that do not tolerate some values).

It's really a good idea that it is possible to retrieve back the MSR value before it is modified. It would be a requirement for hwpstate_amd(4).

I'm again not sure if this is the direction we want to go. As x86_msr_op() is about a single (read or read-modify-write) operation, it has a very short execution time. Not only using sched_bind() seems a big hammer for that (it has higher overhead), but also depending on the priority of the calling thread execution might be delayed a significant amount of time (with a real time thread already executing there, possibly indefinitely). To me, and IIRC one of our discussions for markj@ as well, the direction was more to deprecate using sched_bind() for this purpose, eventually retiring all MSR_OP_SCHED_* variants, whereas that patch *adds* new ones.

Yes, IMO it is almost always wrong to use sched_bind() for short sections of code. I lost count of the number of times I had to debug problems caused by this pattern, but the first one was related to commit e42412a96695d0e9116de5c6ac75823d2711218d. For similar reasons, quiesce_all_cpus() should never be used either.

In D54996 there is a claim that FreeBSD is a realtime kernel, but this is not true in general. If a user thread binds to a specific CPU, there is no guarantee it will make progress. So if we want to have sysctl handlers which toggle unsafe MSRs, then either:

  • we should tolerate #GP with interrupts disabled (which we already do I believe), or
  • we need the scheduler to provide a hard guarantee that a thread bound to a specific CPU will be scheduled "soon", even if there is some high-priority realtime-class thread running on the CPU
sys/x86/x86/cpu_machdep.c
234

I don't really understand--why not? Can't we handle #GP with interrupts disabled?

Well, I said that the kernel is real-time, not that the system is. One of the reason why it is not is the frivolous use of the sections with disabled interrupts, and I do not see why adding more of it.
For instance, RENDEZVOUS which disables interrupts must wait for all CPUs to leave the sections of disabled interrupts, which is accumulating and causes even longer delays.

If the concern with the SCHED is that it might be problematic to get scheduled to all cores due to priorities, I can add the prio bump around it (and return it back afterward, of course).

sys/x86/x86/cpu_machdep.c
234

Generally pcb_onfault is obliterated by code that sets it up. It just happen that xxmsr_safe() preserves it.
I do not want to make an impression that RENDEZVOUS is callable from any context.

kib edited the summary of this revision. (Show Details)

Bump prio around SCHED

In D55045#1263241, @kib wrote:

Well, I said that the kernel is real-time, not that the system is. One of the reason why it is not is the frivolous use of the sections with disabled interrupts, and I do not see why adding more of it.

Well, it will happen either way, thread_lock()+sched_bind() also disables interrupts.

For instance, RENDEZVOUS which disables interrupts must wait for all CPUs to leave the sections of disabled interrupts, which is accumulating and causes even longer delays.

Maybe a stupid question, but why does smp_rendezvous() need to keep interrupts disabled while waiting for the target cores to finish? That is, why is smp_ipi_mtx a spin mutex and not a regular mutex?

If the concern with the SCHED is that it might be problematic to get scheduled to all cores due to priorities, I can add the prio bump around it (and return it back afterward, of course).

I think this is a prerequisite for SCHED to be useful in general. But to be clear I am concerned about the case where an ithread or higher priority thread is monopolizing the target core, so the priority bump needs to be sufficient to interrupt any kernel thread. In that case, what's the advantage of scheduling on the target core instead of using an IPI? Just that the initiator does not need to disable interrupts as well?

sys/x86/x86/cpu_machdep.c
234

Well, smp_rendezvous() already requires the caller to have interrupts enabled, so it already cannot be called from some contexts.

In D55045#1263241, @kib wrote:

For instance, RENDEZVOUS which disables interrupts must wait for all CPUs to leave the sections of disabled interrupts, which is accumulating and causes even longer delays.

If we are targeting all CPUs, sending an IPI to all involved is quickly much more efficient than having a single thread binding itself successively to the different CPUs as the number of cores increases.

Currently, doing the latter is still better than sending an IPI in terms of "real-time" behavior because indeed interrupts are periodically re-enabled around x86_msr_op_one(), but that seems to hint more for the need of a higher-level form of rendez-vous.

If the concern with the SCHED is that it might be problematic to get scheduled to all cores due to priorities, I can add the prio bump around it (and return it back afterward, of course).

PI_REALTIME is perhaps too strong, PI_SOFT seems sufficient. Even with PI_REALTIME, you could be delayed indefinitely in theory, but then there's the more pressing problem of a high-priority interrupt thread gone mad.

Well, it will happen either way, thread_lock()+sched_bind() also disables interrupts.

Yes, but here (current version), interrupts are re-enabled for a while around x86_msr_op_one(), so you could take interrupts there.

I think this is a prerequisite for SCHED to be useful in general. But to be clear I am concerned about the case where an ithread or higher priority thread is monopolizing the target core, so the priority bump needs to be sufficient to interrupt any kernel thread. In that case, what's the advantage of scheduling on the target core instead of using an IPI? Just that the initiator does not need to disable interrupts as well?

It's not only the initiator, but also the fact that interrupts cannot be re-enabled until we return from the IPI.

What I think we basically need is some higher-level rendez-vous API. Perhaps we could have system-wide task queues, one per core, each with a single executor thread properly with sufficiently high priority, that we would post a task to. Something in fact similar to software-interrupt threads. We shouldn't need to play with sched_bind() and sched_prio() to get that effect.

Currently, doing the latter is still better than sending an IPI in terms of "real-time" behavior because indeed interrupts are periodically re-enabled around x86_msr_op_one(), but that seems to hint more for the need of a higher-level form of rendez-vous.

I'm not even sure about that now. Binding on each CPU will cause a thread switch, which disables interrupts, and we'll have two of these per core (one for switch-in and one for switch-out), which in terms of duration of disabled interrupts may well prove longer than just the rendez-vous IPI and the small payload of handling the MSR, perhaps even for MSRs that are long to operate on.

And if this advantage vanishes, there does not remain much point in keeping the sched_bind()-based variants.

sys/x86/x86/cpu_machdep.c
234

Generally pcb_onfault is obliterated by code that sets it up. It just happen that xxmsr_safe() preserves it.

Why does that make any difference? AFAIK, there is never any nesting of functions that set pcb_onfault.

Currently, doing the latter is still better than sending an IPI in terms of "real-time" behavior because indeed interrupts are periodically re-enabled around x86_msr_op_one(), but that seems to hint more for the need of a higher-level form of rendez-vous.

I'm not even sure about that now. Binding on each CPU will cause a thread switch, which disables interrupts, and we'll have two of these per core (one for switch-in and one for switch-out), which in terms of duration of disabled interrupts may well prove longer than just the rendez-vous IPI and the small payload of handling the MSR, perhaps even for MSRs that are long to operate on.

Yes, this is the point I was trying to make.

Ok, I give up to make some progress. I still disagree with this approach.

sys/x86/x86/cpu_machdep.c
234

If you are using rendezvous, the target might be in pcb_onfault section.

kib edited the summary of this revision. (Show Details)

Keep interrupts disabled around ops for SCHED.
Allow SAFE variants of RENDEZVOUS.

Feel free to ignore this suggestion: rather than having MSR_OP_SAFE, I would instead introduce x86_msr_op_safe(). Then x86_msr_op() can keep the void return type.

Maybe a stupid question, but why does smp_rendezvous() need to keep interrupts disabled while waiting for the target cores to finish? That is, why is smp_ipi_mtx a spin mutex and not a regular mutex?

I am still not sure about this BTW. I tried testing a patch to change it to a regular mutex and don't see any problems.

sys/x86/x86/cpu_machdep.c
144

If a->error = 0 and two threads try to set it to a non-zero value with atomic_cmpset_int(), is it guaranteed that one of them will succeed? I think the answer is probably yes on x86 but I'm not sure in general.

This revision is now accepted and ready to land.Thu, Feb 12, 2:29 PM

I'm not even sure about that now. Binding on each CPU will cause a thread switch, which disables interrupts, and we'll have two of these per core (one for switch-in and one for switch-out), which in terms of duration of disabled interrupts may well prove longer than just the rendez-vous IPI and the small payload of handling the MSR, perhaps even for MSRs that are long to operate on.

Yes, this is the point I was trying to make.

Given your formulation, I was wondering if you had missed the fact that interrupts were re-enabled temporarily around the MSR operation, i.e., between two successive sched_bind().

Consequently, re-enabling interrupts between two sched_bind() can be an advantage over sending an IPI in terms of minimum latency to interrupt if "half" a single bind (i.e., max of duration on the initial CPU and duration on the target CPU) is shorter than the whole process of sending an IPI, but even with that "half" part I doubt it.

kib marked an inline comment as done.Thu, Feb 12, 3:04 PM

Feel free to ignore this suggestion: rather than having MSR_OP_SAFE, I would instead introduce x86_msr_op_safe(). Then x86_msr_op() can keep the void return type.

I prefer to expose one function. There is no harm from returning unused value.

Maybe a stupid question, but why does smp_rendezvous() need to keep interrupts disabled while waiting for the target cores to finish? That is, why is smp_ipi_mtx a spin mutex and not a regular mutex?

I am still not sure about this BTW. I tried testing a patch to change it to a regular mutex and don't see any problems.

As an immediate issue, using sleeping mutex would allow the main thread to migrate, which makes the proposition of 'IPI all CPUs but itself' somewhat weak.

sys/x86/x86/cpu_machdep.c
144

I see two interpretations for your question.

First, on all arches all atomics performed on the same location have global order.

Second, C standard defines weak and strong variants for CAS. I believe all our arches provide the strong version E.g. arm64 loops if store-release failed for !LSE. For LSE, the wording matches the Intel description: if cmp succeeded, the write occurs.

olce added inline comments.
sys/x86/x86/cpu_machdep.c
144

Cannot be the case on x86. But if that's the case on other architectures, shouldn't we fix that directly in atomic_cmpset_int() (but perhaps the problem is then in how to do that, and if one size fits all?)?. atomic(9) mentions such a possibility for atomic_fcmpset_int() but seems unclear about atomic_cmpset_int(), although technically I don't think these should be any different.

kib marked 2 inline comments as done.Thu, Feb 12, 3:23 PM
kib added inline comments.
sys/x86/x86/cpu_machdep.c
144

fcmpset is weak, e.g. on arm64.

In D55045#1263800, @kib wrote:

I prefer to expose one function. There is no harm from returning unused value.

Mmm, if there's only one function returning a value, there's always the suspicion that this value should be tested in all cases. Exposing two with different signatures avoids this problem. That seems better in general.

In D55045#1263800, @kib wrote:

Feel free to ignore this suggestion: rather than having MSR_OP_SAFE, I would instead introduce x86_msr_op_safe(). Then x86_msr_op() can keep the void return type.

I prefer to expose one function. There is no harm from returning unused value.

The point is to avoid ambiguity about whether it's necessary to check the return value.

kib marked an inline comment as done.Thu, Feb 12, 3:29 PM
sys/x86/x86/cpu_machdep.c
144

And it is explicitly noted in the atomic.9 man page.

sys/x86/x86/cpu_machdep.c
144

I've just checked, right. I find this discrepancy between atomic_cmpset_int() and atomic_fcmpset_int() rather strange and error-prone.

sys/x86/x86/cpu_machdep.c
144

This is what I was referring to initially, the fact that atomic_cmpset_int() and atomic_fcmpset_int() are different in the manual page.