Page MenuHomeFreeBSD

sched: separate out schedinit_ap()
ClosedPublic

Authored by kevans on Nov 2 2021, 8:09 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Mar 30, 3:55 PM
Unknown Object (File)
Mar 18 2024, 10:33 PM
Unknown Object (File)
Mar 5 2024, 2:57 AM
Unknown Object (File)
Feb 6 2024, 9:51 AM
Unknown Object (File)
Jan 27 2024, 3:09 AM
Unknown Object (File)
Jan 27 2024, 3:09 AM
Unknown Object (File)
Jan 27 2024, 3:09 AM
Unknown Object (File)
Jan 27 2024, 3:09 AM

Details

Summary

schedinit_ap() sets up an AP for a later call to sched_throw(NULL).

Currently, ULE sets up some pcpu bits and fixes the idlethread lock with
a call to sched_throw(NULL); this creates a window where curthread is
setup but it has the wrong td_lock. Typical platform AP startup
procedure looks something like:

  • Setup curthread
  • ... other stuff, including cpu_initclocks_ap()
  • Signal smp_started
  • sched_throw(NULL) to enter the scheduler

cpu_initclocks_ap() may have callouts to process (e.g., nvme) and
attempt to sched_add() for this AP, but this attempt fails because
of the noted violated assumption leading to locking heartburn in
sched_setpreempt().

Interrupts are still disabled until cpu_throw() so we're not really at
risk of being preempted -- just let the scheduler in on it a little
earlier.

Test Plan

non-arm64 platforms are untested...

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

kevans requested review of this revision.Nov 2 2021, 8:09 AM

I wonder if there's a generic way we could handle this as a quick check shows the same issue on other architectures when we don't have EARLY_AP_STARTUP.

This revision is now accepted and ready to land.Nov 2 2021, 12:39 PM

I'm sorry, but I still don't understand the problem. Which scheduler data structures are not initialized at SI_SUB_SMP? Note that "at init_secondary() time" is ambiguous: APs are started fairly early during boot, but they won't call cpu_initclocks_ap() until aps_ready is set, which happens at SI_SUB_SMP, i.e., quite late, and a point where it's perfectly safe to schedule other threads. But we shouldn't be switching to other threads unless the idlethread's spinlock count drops to 0, which shouldn't happen until we reach fork_exit().

I'm sorry, but I still don't understand the problem. Which scheduler data structures are not initialized at SI_SUB_SMP? Note that "at init_secondary() time" is ambiguous: APs are started fairly early during boot, but they won't call cpu_initclocks_ap() until aps_ready is set, which happens at SI_SUB_SMP, i.e., quite late, and a point where it's perfectly safe to schedule other threads. But we shouldn't be doing that unless the idlethread's spinlock count drops to 0, which shouldn't happen until we reach fork_exit().

Sorry, the description ended up being really bad/inaccurate. I'll rewrite it, but here's the breakdown:

  • init_secondary() -> cpu_initclocks_ap() -> handleevents() -> callout_process() -> swi_sched() -> intr_event_schedule_thread() () -> sched_add() -> ... -> sched_setpreempt() -> asserts on curthread->td_lock *before* init_secondary has called sched_throw(NULL)
  • curthread == idlethread at that time
  • idlethread->td_lock isn't setup until later, at https://cgit.freebsd.org/src/tree/sys/kern/sched_ule.c#n2986

(edit: looking at it again, I think I missed something still... it made a lot more sense in the middle of the night)

I'm sorry, but I still don't understand the problem. Which scheduler data structures are not initialized at SI_SUB_SMP? Note that "at init_secondary() time" is ambiguous: APs are started fairly early during boot, but they won't call cpu_initclocks_ap() until aps_ready is set, which happens at SI_SUB_SMP, i.e., quite late, and a point where it's perfectly safe to schedule other threads. But we shouldn't be doing that unless the idlethread's spinlock count drops to 0, which shouldn't happen until we reach fork_exit().

Sorry, the description ended up being really bad/inaccurate. I'll rewrite it, but here's the breakdown:

  • init_secondary() -> cpu_initclocks_ap() -> handleevents() -> callout_process() -> swi_sched() -> intr_event_schedule_thread() () -> sched_add() -> ... -> sched_setpreempt() -> asserts on curthread->td_lock *before* init_secondary has called sched_throw(NULL)
  • curthread == idlethread at that time
  • idlethread->td_lock isn't setup until later, at https://cgit.freebsd.org/src/tree/sys/kern/sched_ule.c#n2986

(edit: looking at it again, I think I missed something still... it made a lot more sense in the middle of the night)

That's interesting. A couple of observations:

  • init_secondary() sets curthread to the AP's idle thread. The idle threads are created in idle_setup(). When creating a new thread, we _do_set td_lock, in sched_fork_thread(). So init_secondary() should be running with an initialized td_lock. But, it's the wrong lock initially: ULE uses per-CPU locks and the new thread's lock is inherited from the creating thread's. When the BSP creates idle threads, they'll all have td_lock set to CPU 0's lock. This gets corrected later as you noted, but not soon enough.
  • Nothing actually acquires curthread's lock in the call path you described. The code is assuming that curthread's td_lock is the same as td's td_lock. Note in particular that sched_pickcpu() will just return the current CPU since smp_started is still false when cpu_initclocks_ap() is called. So there is a window where curthread's td_lock is wrong, and that violates the assumption of sched_add().

So I guess the question is, do we want to fix up scheduler state earlier in init_secondary(), or go with an approach like the one in this patch?

I'm sorry, but I still don't understand the problem. Which scheduler data structures are not initialized at SI_SUB_SMP? Note that "at init_secondary() time" is ambiguous: APs are started fairly early during boot, but they won't call cpu_initclocks_ap() until aps_ready is set, which happens at SI_SUB_SMP, i.e., quite late, and a point where it's perfectly safe to schedule other threads. But we shouldn't be doing that unless the idlethread's spinlock count drops to 0, which shouldn't happen until we reach fork_exit().

Sorry, the description ended up being really bad/inaccurate. I'll rewrite it, but here's the breakdown:

  • init_secondary() -> cpu_initclocks_ap() -> handleevents() -> callout_process() -> swi_sched() -> intr_event_schedule_thread() () -> sched_add() -> ... -> sched_setpreempt() -> asserts on curthread->td_lock *before* init_secondary has called sched_throw(NULL)
  • curthread == idlethread at that time
  • idlethread->td_lock isn't setup until later, at https://cgit.freebsd.org/src/tree/sys/kern/sched_ule.c#n2986

(edit: looking at it again, I think I missed something still... it made a lot more sense in the middle of the night)

That's interesting. A couple of observations:

  • init_secondary() sets curthread to the AP's idle thread. The idle threads are created in idle_setup(). When creating a new thread, we _do_set td_lock, in sched_fork_thread(). So init_secondary() should be running with an initialized td_lock. But, it's the wrong lock initially: ULE uses per-CPU locks and the new thread's lock is inherited from the creating thread's. When the BSP creates idle threads, they'll all have td_lock set to CPU 0's lock. This gets corrected later as you noted, but not soon enough.
  • Nothing actually acquires curthread's lock in the call path you described. The code is assuming that curthread's td_lock is the same as td's td_lock. Note in particular that sched_pickcpu() will just return the current CPU since smp_started is still false when cpu_initclocks_ap() is called. So there is a window where curthread's td_lock is wrong, and that violates the assumption of sched_add().

So I guess the question is, do we want to fix up scheduler state earlier in init_secondary(), or go with an approach like the one in this patch?

Let's pull in, maybe, @jhb and @kib as well, since other platforms (including x86) with EARLY_AP_STARTUP unset are also open to it.

Adding a sched_init_ap() that basically does the td == NULL branch (leaving the td != NULL branch) of sched_throw() seems like it'd be sufficient and, if I'm understanding this right, allow us to just handle the pending nvme callout without an initial trip through the idle thread? That'd be less ugly, too.

Adding a sched_init_ap() that basically does the td == NULL branch (leaving the td != NULL branch) of sched_throw() seems like it'd be sufficient and, if I'm understanding this right, allow us to just handle the pending nvme callout without an initial trip through the idle thread? That'd be less ugly, too.

Sorry, last comment for now... this also does what I want, if that's a little more palatable: https://people.freebsd.org/~kevans/sched.diff -- then we have no problem scheduling the callout and it's a little more elegant, while being much less invasive looking. Only arm64 done here, but I can fix up the other archs if that's the route we want to go.

Adding a sched_init_ap() that basically does the td == NULL branch (leaving the td != NULL branch) of sched_throw() seems like it'd be sufficient and, if I'm understanding this right, allow us to just handle the pending nvme callout without an initial trip through the idle thread? That'd be less ugly, too.

Sorry, last comment for now... this also does what I want, if that's a little more palatable: https://people.freebsd.org/~kevans/sched.diff -- then we have no problem scheduling the callout and it's a little more elegant, while being much less invasive looking. Only arm64 done here, but I can fix up the other archs if that's the route we want to go.

I like this approach. In general I would put this call immediately after the code in init_secondary() which sets curthread.

Switch to initializing these bits of ULE state earlier instead -- phab metadata
fix incoming

This revision now requires review to proceed.Nov 2 2021, 6:30 PM
kevans retitled this revision from arm64: only call cpu_initclocks_ap() after SMP setup to sched: separate out schedinit_ap().Nov 2 2021, 6:31 PM
kevans edited the summary of this revision. (Show Details)
kevans edited the test plan for this revision. (Show Details)
kevans added reviewers: PowerPC, jhb, kib.
sys/powerpc/aim/mp_cpudep.c
138

Typo here and below?

markj added inline comments.
sys/powerpc/aim/mp_cpudep.c
138
sys/powerpc/booke/mp_cpudep.c
89
This revision is now accepted and ready to land.Nov 2 2021, 6:47 PM
kevans marked 3 inline comments as done.
kevans edited the summary of this revision. (Show Details)

Fix typos in powerpc

This revision now requires review to proceed.Nov 2 2021, 6:52 PM
sys/powerpc/aim/mp_cpudep.c
138

*sigh* yeah, sorry. :-( powerpc had thrown me off a little bit because I had to dig around a little more to make sure I didn't miss something other than these two spots and braino'd it.

sys/kern/sched_ule.c
1747

You might consider having a comment here to explain why this is needed before sched_throw(NULL).

sys/powerpc/aim/mp_cpudep.c
39

*sigh* botched again, needs sys/sched.h here and below... tinderbox running.

Added comment, missing includes, & tinderboxed

This revision is now accepted and ready to land.Nov 2 2021, 9:58 PM

If you are adding the stuff to be called from initsecondary(), can we try to remove td == NULL code from sched_throw() at all, moving it all to schedinit_ap()?

sys/kern/sched_ule.c
2996

I suspect this note would be more appropriate in the comment above schedinit_ap()

2997–2998

I think 'the correct' part of the comment is not quite needed anymore. Perhaps the whole part about lock can be dropped, and 'correct spinlock nesting' moved just to spinlock_exit() line.

3005

I suspect that this LOCKPTR_ASSERT() is now valid always.

In D32797#740443, @kib wrote:

If you are adding the stuff to be called from initsecondary(), can we try to remove td == NULL code from sched_throw() at all, moving it all to schedinit_ap()?

We can't correct the spinlock nesting from schedinit_ap() because we can't hold the sched lock from there all the way to sched_throw() without recursing on it in between if there's a callout pending. We can't really drop it completely or anything that uses spinlocks between the two risks being preempted prematurely upon dropping the lock. =(

kevans marked 2 inline comments as done.

Add TDQ_SELF() comment to the general schedinit_ap() note, and drop the bits
about 'the correct lock'.

This revision now requires review to proceed.Nov 3 2021, 6:56 AM
sys/kern/sched_ule.c
3005

td == NULL in the other path, though.

This revision is now accepted and ready to land.Nov 3 2021, 8:21 AM

Adding a sched_init_ap() that basically does the td == NULL branch (leaving the td != NULL branch) of sched_throw() seems like it'd be sufficient and, if I'm understanding this right, allow us to just handle the pending nvme callout without an initial trip through the idle thread?

That's right. The choosethread() call in sched_throw() ought to select the softclock thread that was scheduled when calling cpu_initclocks_ap(), unless there's an even higher priority thread waiting to run.

alfredo added a subscriber: alfredo.

I boot-tested this patch and no regression found on:

  • QEMU VM POWER9 with hardware acceleration (hw.ncpu: 144)
  • bare metal POWER9 hw.ncpu: 16
This revision was automatically updated to reflect the committed changes.