Details

Reviewers

andrew
markj
manu
jhb
kib
alfredo

Group Reviewers

arm64
PowerPC

Commits

rG589aed00e36c: sched: separate out schedinit_ap()

Summary

schedinit_ap() sets up an AP for a later call to sched_throw(NULL).

Currently, ULE sets up some pcpu bits and fixes the idlethread lock with
a call to sched_throw(NULL); this creates a window where curthread is
setup but it has the wrong td_lock. Typical platform AP startup
procedure looks something like:

Setup curthread
... other stuff, including cpu_initclocks_ap()
Signal smp_started
sched_throw(NULL) to enter the scheduler

cpu_initclocks_ap() may have callouts to process (e.g., nvme) and
attempt to sched_add() for this AP, but this attempt fails because
of the noted violated assumption leading to locking heartburn in
sched_setpreempt().

Interrupts are still disabled until cpu_throw() so we're not really at
risk of being preempted -- just let the scheduler in on it a little
earlier.

Test Plan

non-arm64 platforms are untested...

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kevans created this revision.Nov 2 2021, 8:09 AM

Herald added a reviewer: manu. · View Herald TranscriptNov 2 2021, 8:09 AM

Herald added subscribers: emaste, imp. · View Herald Transcript

kevans requested review of this revision.Nov 2 2021, 8:09 AM

Harbormaster completed remote builds in B42513: Diff 97848.Nov 2 2021, 8:09 AM

kevans mentioned this in D32743: nvme: avoid callout_reset_on in early boot.Nov 2 2021, 8:11 AM

kevans added a reviewer: arm64.Nov 2 2021, 8:31 AM

I wonder if there's a generic way we could handle this as a quick check shows the same issue on other architectures when we don't have EARLY_AP_STARTUP.

This revision is now accepted and ready to land.Nov 2 2021, 12:39 PM

I'm sorry, but I still don't understand the problem. Which scheduler data structures are not initialized at SI_SUB_SMP? Note that "at init_secondary() time" is ambiguous: APs are started fairly early during boot, but they won't call cpu_initclocks_ap() until aps_ready is set, which happens at SI_SUB_SMP, i.e., quite late, and a point where it's perfectly safe to schedule other threads. But we shouldn't be switching to other threads unless the idlethread's spinlock count drops to 0, which shouldn't happen until we reach fork_exit().

In D32797#740166, @markj wrote:

I'm sorry, but I still don't understand the problem. Which scheduler data structures are not initialized at SI_SUB_SMP? Note that "at init_secondary() time" is ambiguous: APs are started fairly early during boot, but they won't call cpu_initclocks_ap() until aps_ready is set, which happens at SI_SUB_SMP, i.e., quite late, and a point where it's perfectly safe to schedule other threads. But we shouldn't be doing that unless the idlethread's spinlock count drops to 0, which shouldn't happen until we reach fork_exit().

Sorry, the description ended up being really bad/inaccurate. I'll rewrite it, but here's the breakdown:

init_secondary() -> cpu_initclocks_ap() -> handleevents() -> callout_process() -> swi_sched() -> intr_event_schedule_thread() () -> sched_add() -> ... -> sched_setpreempt() -> asserts on curthread->td_lock *before* init_secondary has called sched_throw(NULL)
curthread == idlethread at that time
idlethread->td_lock isn't setup until later, at https://cgit.freebsd.org/src/tree/sys/kern/sched_ule.c#n2986

(edit: looking at it again, I think I missed something still... it made a lot more sense in the middle of the night)

In D32797#740185, @kevans wrote:

In D32797#740166, @markj wrote:

I'm sorry, but I still don't understand the problem. Which scheduler data structures are not initialized at SI_SUB_SMP? Note that "at init_secondary() time" is ambiguous: APs are started fairly early during boot, but they won't call cpu_initclocks_ap() until aps_ready is set, which happens at SI_SUB_SMP, i.e., quite late, and a point where it's perfectly safe to schedule other threads. But we shouldn't be doing that unless the idlethread's spinlock count drops to 0, which shouldn't happen until we reach fork_exit().

Sorry, the description ended up being really bad/inaccurate. I'll rewrite it, but here's the breakdown:

init_secondary() -> cpu_initclocks_ap() -> handleevents() -> callout_process() -> swi_sched() -> intr_event_schedule_thread() () -> sched_add() -> ... -> sched_setpreempt() -> asserts on curthread->td_lock *before* init_secondary has called sched_throw(NULL)

curthread == idlethread at that time

idlethread->td_lock isn't setup until later, at https://cgit.freebsd.org/src/tree/sys/kern/sched_ule.c#n2986

(edit: looking at it again, I think I missed something still... it made a lot more sense in the middle of the night)

That's interesting. A couple of observations:

init_secondary() sets curthread to the AP's idle thread. The idle threads are created in idle_setup(). When creating a new thread, we _do_set td_lock, in sched_fork_thread(). So init_secondary() should be running with an initialized td_lock. But, it's the wrong lock initially: ULE uses per-CPU locks and the new thread's lock is inherited from the creating thread's. When the BSP creates idle threads, they'll all have td_lock set to CPU 0's lock. This gets corrected later as you noted, but not soon enough.
Nothing actually acquires curthread's lock in the call path you described. The code is assuming that curthread's td_lock is the same as td's td_lock. Note in particular that sched_pickcpu() will just return the current CPU since smp_started is still false when cpu_initclocks_ap() is called. So there is a window where curthread's td_lock is wrong, and that violates the assumption of sched_add().

So I guess the question is, do we want to fix up scheduler state earlier in init_secondary(), or go with an approach like the one in this patch?

In D32797#740186, @markj wrote:

In D32797#740185, @kevans wrote:

In D32797#740166, @markj wrote:

I'm sorry, but I still don't understand the problem. Which scheduler data structures are not initialized at SI_SUB_SMP? Note that "at init_secondary() time" is ambiguous: APs are started fairly early during boot, but they won't call cpu_initclocks_ap() until aps_ready is set, which happens at SI_SUB_SMP, i.e., quite late, and a point where it's perfectly safe to schedule other threads. But we shouldn't be doing that unless the idlethread's spinlock count drops to 0, which shouldn't happen until we reach fork_exit().

Sorry, the description ended up being really bad/inaccurate. I'll rewrite it, but here's the breakdown:

init_secondary() -> cpu_initclocks_ap() -> handleevents() -> callout_process() -> swi_sched() -> intr_event_schedule_thread() () -> sched_add() -> ... -> sched_setpreempt() -> asserts on curthread->td_lock *before* init_secondary has called sched_throw(NULL)

curthread == idlethread at that time

idlethread->td_lock isn't setup until later, at https://cgit.freebsd.org/src/tree/sys/kern/sched_ule.c#n2986

(edit: looking at it again, I think I missed something still... it made a lot more sense in the middle of the night)

That's interesting. A couple of observations:

init_secondary() sets curthread to the AP's idle thread. The idle threads are created in idle_setup(). When creating a new thread, we _do_set td_lock, in sched_fork_thread(). So init_secondary() should be running with an initialized td_lock. But, it's the wrong lock initially: ULE uses per-CPU locks and the new thread's lock is inherited from the creating thread's. When the BSP creates idle threads, they'll all have td_lock set to CPU 0's lock. This gets corrected later as you noted, but not soon enough.

Nothing actually acquires curthread's lock in the call path you described. The code is assuming that curthread's td_lock is the same as td's td_lock. Note in particular that sched_pickcpu() will just return the current CPU since smp_started is still false when cpu_initclocks_ap() is called. So there is a window where curthread's td_lock is wrong, and that violates the assumption of sched_add().

So I guess the question is, do we want to fix up scheduler state earlier in init_secondary(), or go with an approach like the one in this patch?

Let's pull in, maybe, @jhb and @kib as well, since other platforms (including x86) with EARLY_AP_STARTUP unset are also open to it.

Adding a sched_init_ap() that basically does the td == NULL branch (leaving the td != NULL branch) of sched_throw() seems like it'd be sufficient and, if I'm understanding this right, allow us to just handle the pending nvme callout without an initial trip through the idle thread? That'd be less ugly, too.

In D32797#740218, @kevans wrote:

Adding a sched_init_ap() that basically does the td == NULL branch (leaving the td != NULL branch) of sched_throw() seems like it'd be sufficient and, if I'm understanding this right, allow us to just handle the pending nvme callout without an initial trip through the idle thread? That'd be less ugly, too.

Sorry, last comment for now... this also does what I want, if that's a little more palatable: https://people.freebsd.org/~kevans/sched.diff -- then we have no problem scheduling the callout and it's a little more elegant, while being much less invasive looking. Only arm64 done here, but I can fix up the other archs if that's the route we want to go.

In D32797#740258, @kevans wrote:

In D32797#740218, @kevans wrote:

Adding a sched_init_ap() that basically does the td == NULL branch (leaving the td != NULL branch) of sched_throw() seems like it'd be sufficient and, if I'm understanding this right, allow us to just handle the pending nvme callout without an initial trip through the idle thread? That'd be less ugly, too.

Sorry, last comment for now... this also does what I want, if that's a little more palatable: https://people.freebsd.org/~kevans/sched.diff -- then we have no problem scheduling the callout and it's a little more elegant, while being much less invasive looking. Only arm64 done here, but I can fix up the other archs if that's the route we want to go.

I like this approach. In general I would put this call immediately after the code in init_secondary() which sets curthread.

Switch to initializing these bits of ULE state earlier instead -- phab metadata
fix incoming

This revision now requires review to proceed.Nov 2 2021, 6:30 PM

Herald added a subscriber: jhibbits. · View Herald TranscriptNov 2 2021, 6:30 PM

Harbormaster completed remote builds in B42527: Diff 97877.Nov 2 2021, 6:30 PM

kevans retitled this revision from arm64: only call cpu_initclocks_ap() after SMP setup to sched: separate out schedinit_ap().Nov 2 2021, 6:31 PM

kevans edited the summary of this revision. (Show Details)

kevans edited the test plan for this revision. (Show Details)

kevans added reviewers: PowerPC, jhb, kib.

jhibbits added inline comments.Nov 2 2021, 6:46 PM

sys/powerpc/aim/mp_cpudep.c
138	Typo here and below?

markj accepted this revision.Nov 2 2021, 6:47 PM

markj added inline comments.

sys/powerpc/aim/mp_cpudep.c
138
sys/powerpc/booke/mp_cpudep.c
89

This revision is now accepted and ready to land.Nov 2 2021, 6:47 PM

Fix typos in powerpc

This revision now requires review to proceed.Nov 2 2021, 6:52 PM

Harbormaster completed remote builds in B42529: Diff 97880.Nov 2 2021, 6:52 PM

kevans added inline comments.Nov 2 2021, 6:52 PM

sys/powerpc/aim/mp_cpudep.c
138	sigh yeah, sorry. :-( powerpc had thrown me off a little bit because I had to dig around a little more to make sure I didn't miss something other than these two spots and braino'd it.

markj added inline comments.Nov 2 2021, 7:03 PM

sys/kern/sched_ule.c
1747	You might consider having a comment here to explain why this is needed before sched_throw(NULL).

kevans added inline comments.Nov 2 2021, 7:32 PM

sys/powerpc/aim/mp_cpudep.c
39	sigh botched again, needs sys/sched.h here and below... tinderbox running.

Added comment, missing includes, & tinderboxed

Harbormaster completed remote builds in B42531: Diff 97890.Nov 2 2021, 9:45 PM

markj accepted this revision.Nov 2 2021, 9:58 PM

This revision is now accepted and ready to land.Nov 2 2021, 9:58 PM

If you are adding the stuff to be called from initsecondary(), can we try to remove td == NULL code from sched_throw() at all, moving it all to schedinit_ap()?

sys/kern/sched_ule.c
2996	I suspect this note would be more appropriate in the comment above schedinit_ap()
2997–2998	I think 'the correct' part of the comment is not quite needed anymore. Perhaps the whole part about lock can be dropped, and 'correct spinlock nesting' moved just to spinlock_exit() line.
3005	I suspect that this LOCKPTR_ASSERT() is now valid always.

In D32797#740443, @kib wrote:

If you are adding the stuff to be called from initsecondary(), can we try to remove td == NULL code from sched_throw() at all, moving it all to schedinit_ap()?

We can't correct the spinlock nesting from schedinit_ap() because we can't hold the sched lock from there all the way to sched_throw() without recursing on it in between if there's a callout pending. We can't really drop it completely or anything that uses spinlocks between the two risks being preempted prematurely upon dropping the lock. =(

Add TDQ_SELF() comment to the general schedinit_ap() note, and drop the bits
about 'the correct lock'.

This revision now requires review to proceed.Nov 3 2021, 6:56 AM

Harbormaster completed remote builds in B42539: Diff 97904.Nov 3 2021, 6:56 AM

kevans added inline comments.Nov 3 2021, 6:57 AM

sys/kern/sched_ule.c
3005	`td == NULL` in the other path, though.

kib accepted this revision.Nov 3 2021, 8:21 AM

This revision is now accepted and ready to land.Nov 3 2021, 8:21 AM

In D32797#740218, @kevans wrote:

Adding a sched_init_ap() that basically does the td == NULL branch (leaving the td != NULL branch) of sched_throw() seems like it'd be sufficient and, if I'm understanding this right, allow us to just handle the pending nvme callout without an initial trip through the idle thread?

That's right. The choosethread() call in sched_throw() ought to select the softclock thread that was scheduled when calling cpu_initclocks_ap(), unless there's an even higher priority thread waiting to run.

I boot-tested this patch and no regression found on:

QEMU VM POWER9 with hardware acceleration (hw.ncpu: 144)
bare metal POWER9 hw.ncpu: 16

Closed by commit rG589aed00e36c: sched: separate out schedinit_ap() (authored by kevans). · Explain WhyNov 3 2021, 8:55 PM

This revision was automatically updated to reflect the committed changes.

kevans added a commit: rG589aed00e36c: sched: separate out schedinit_ap().

sched: separate out schedinit_ap()
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 97969

sys/arm/arm/mp_machdep.c

sys/arm64/arm64/mp_machdep.c

sys/kern/sched_4bsd.c

sys/kern/sched_ule.c

sys/mips/mips/mp_machdep.c

sys/powerpc/aim/mp_cpudep.c

sys/powerpc/booke/mp_cpudep.c

sys/riscv/riscv/mp_machdep.c

sys/sys/sched.h

sys/x86/x86/mp_x86.c

sched: separate out schedinit_ap()ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 97969

sys/arm/arm/mp_machdep.c

sys/arm64/arm64/mp_machdep.c

sys/kern/sched_4bsd.c

sys/kern/sched_ule.c

sys/mips/mips/mp_machdep.c

sys/powerpc/aim/mp_cpudep.c

sys/powerpc/booke/mp_cpudep.c

sys/riscv/riscv/mp_machdep.c

sys/sys/sched.h

sys/x86/x86/mp_x86.c

sched: separate out schedinit_ap()
ClosedPublic
Actions

Revision Contents
Changeset List