Page MenuHomeFreeBSD

Avoid having PowerMacs ending up with stuck-sleeping threads: force some boot-time TB value relationships across sockets/cores.
AbandonedPublic

Authored by markmi_dsl-only.net on Jan 26 2020, 11:04 PM.
Referenced Files
Unknown Object (File)
Sun, Dec 15, 5:10 PM
Unknown Object (File)
Dec 5 2024, 4:50 AM
Unknown Object (File)
Nov 21 2024, 12:44 PM
Unknown Object (File)
Nov 21 2024, 12:43 PM
Unknown Object (File)
Nov 21 2024, 12:23 PM
Unknown Object (File)
Nov 13 2024, 8:34 PM
Unknown Object (File)
Nov 6 2024, 7:22 PM
Unknown Object (File)
Oct 2 2024, 8:34 PM

Details

Reviewers
jhibbits
nwhitehorn
Group Reviewers
PowerPC
Summary

Explicitly initialize the bsp's TB value early on. For PowerMacs with multiple sockets or cores only, have machdep_ap_bootstrap for each ap make a round trip with the bsp to get a current bsp's TB value and then shift the ap's time so that the midpoint of the shifted round tip interval approximately matches the TB value the bsp supplied. The round trips and data transfers use a memory protocol.

For non-PowerMacs, do the old way for the ap's.

Test Plan

I've been running based on the patches being invovled since late 2019-May or so: PowerMac11,2 with 2 sockets, 2 cores each, and PowerMac3,6 with 2 sockets, 1 core each. No examples of 1 socket with 2 cores, unfortunately. No examples of powerpc64 or powerpc non-PowerMacs either. I do use the same source tree to build for amd64, aarch64, and armv7 and they have been working too. But the patch material is extracted from a source tree that has more local changes: the testing has not been with only these changes.

I've not had any examples of the stuck-sleeping issue since adopting these changes. Nor have I had new problems from the changes.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

Note: I was not (and am not) style(9) literate when I did the investigation that eventually stabilized on this code. My information-handling capacity was limited. I fully expect this will need to be dealt with at some point. But I'd prefer to first have the technique(s) used be thought to be stable before focusing on the other type of knowledge.

I'm also okay if my material inspires something and the fix does not end up being my code.

FYI: The submittal was based on a head -r356426 context.

If there are non-PowerMac powerpc platforms that get the threads-stuck-sleeping
problem, then this type of solution may be insufficient for it to generalize for the specific
problem. For example, it is not obvious how accurate such would end up for a context
where NUMA can be relevant. I do not have a context for investigating anything that
modern.

Also, I was only targeting being sufficiently accurate across sockets/cores to avoid the
threads-stuck-sleeping problem, given that FreeBSD can not reasonably require arbitrary
accuracy for that. There could be applications where the scale of accuracy achieved is
insufficient for other reasons. If later a more accurate technique was to be implemented,
I'd suggest replacing this one.

My own evidence of PowerMac sufficiency for my intended purpose of avoiding the
threads-stuck-sleeping problem for multi-socket or mult-core PowerMacs spans
only what I have access to:

PowerMac11,2 2.5 GHz PowerPC 970MP 2 sockets, 2 cores each
PowerMac7,2 2.0 GHz PowerPC 970 2 sockets, 1 core each
PowerMac3,6 1.42 GHz PowerPC 7455 2 sockets, 1 core each

I've no evidence yet of insufficiency for these. All of them previously had
threads-stuck-sleeping examples. I'm hoping others can provide evidence one
way or the other for other multi-core or multi-socket PowerMac examples.

If this does not work as well for the fuller range PowerMacs as I expect,
there would still be the question of if it is appropriate as a partial/temporary
improvement vs. all(?) multi-socket (and multi-core?) PowerMacs having the
problem frequently. It would likely take significant time to get an alternative
ready. (For me there would be a fair amount of research involved to even get
started and the mean time per week applied would not be large as things
are.)

Yeah, I used to see this too (on either power9 or amigaone x5000, can't remember which.) IIRC, the problem is that if the tb is significantly unsynced, the callwheel algorithm falls over and misses wakeups for things.

Yeah, I used to see this too (on either power9 or amigaone x5000, can't remember which.) IIRC, the problem is that if the tb is significantly unsynced, the callwheel algorithm falls over and misses wakeups for things.

I'd only seen PowerMac reports. "used to": has it been fixed for where you saw it?
(If yes, any non-PowerMac failures currently?)

If it still can be a problem outside PowerMacs, then an "alternate_timebase_sync_style= 1;"
would need to happen in some additional context(s) or in a wider context to test
technique-sufficiency on those. But I do not have such a test context in my environment.

Yea, the end result is that sleepq_timeout ends up doing:

if (td->td_sleeptimo > sbinuptime() || td->td_sleeptimo == 0) {
        /*
         * The thread does not want a timeout (yet).
         */
} else . . .

and that leads to never trying again in the overall call structure: no
rescheduling. But simply rescheduling likely would just repeat the
process(?). (This is from memory of investigations back in early 2019.
Limited memory-accuracy is possible and the terminology might not
be the best.)

It makes sense to me that FreeBSD does require some scale of matching
in order to operate, although not arbitrarily small. The scale likely is hardware
speed dependent and time tick interval dependent. I never found anything
with explicit, documented criteria to be met for FreeBSD to be known to
work for these sorts of issues.

The changes you made belong entirely in the platform's sync, nowhere else. Each platform has its own way of doing timebase synchronization.

sys/powerpc/include/cpufunc.h
158

Why?

sys/powerpc/powerpc/mp_machdep.c
72

Everything in here belongs in platform_powermac.c:powermac_smp_timebase_sync()

sys/powerpc/include/cpufunc.h
158

I just avoided forward declaring two static __inline routines or moving the placement of 2 routines to earlier. Other organizations are fine with me if you have a preference.

sys/powerpc/powerpc/mp_machdep.c
72

That spans both the cpu_mp_unleash content changes and the machdep_ap_bootstrap content changes? If so, I'll have to figure out how that fits for this technique (if it even can).

I organized things where one ap at a time was being brought up, basically no interactions possible across aps.

sys/powerpc/powerpc/mp_machdep.c
72

Just an FYI: I finally remembered that back in 2019-Mar/Apr you had me try a
patch for powermac_smp_timebase_sync but it turned out that the use of the
routine was more complicated/odder/in-more-places and the patch would not
work. I think that this is part of what left the impression with me that
powermac_smp_timebase_sync was a more complicated context to deal with
that I was far less likely to get working and reasonably tested fairly quickly,
based on my limited knowledge. I went a different direction at the time.

I'm not claiming that makes the direction I went the appropriate long-term one,
but my knowledge base may make my moving to powermac_smp_timebase_sync
more complicated and time consuming and I may not be able to as reasonably
test the resulting code.

What we really want to achieve is a better timebase sync mechanism. This can only be achieved by disabling the timebase, rendezvousing, and re-enabling it. Adding a drift into the generic code for a specific platform is not the correct solution.

I do intend to shortly (next few days) clean up the one additional platform_timebase_sync() user (cpu_sleep()) to not use it, and handle it directly, since the code in cpu_sleep() is actually mpc74xx-specific, it's not generic at all. So once that's made explicit, a proper rendezvous sync should be done in platform_powermac.c instead.

You can get very close by mimicking what I do in platform_mpc85xx.c, by using atomics to handle the rendezvous, and doing a multi-stage rendezvous.

sys/powerpc/powerpc/machdep.c
288

It's already initialized by firmware. No need to also initialize it here.

sys/powerpc/powerpc/mp_machdep.c
105

Did you check this? It's plausible that the compiler could optimize out the pointers and treat it as a temporary. In fact, I just checked the asm generated for clock.c, and it does elide memory accesses, leading to the following:

1: mftbu %r4
mftb %r5
mftb %r6
cmplw %r4, %r6
bne 1b

So this is unnecessary.

sys/powerpc/powerpc/machdep.c
288

Okay. I'll drop the mttb(0) when I change things.

sys/powerpc/powerpc/mp_machdep.c
105

I had examples where the code generated had indirections to memory. That is what prompted the specific coding that should guarantee the lack of such.

I do not remember the details of the context from about a year ago. But anything that reduces the optimization level in use could potentially have been an example context, I suppose.

What we really want to achieve is a better timebase sync mechanism. This can only be achieved by disabling the timebase, rendezvousing, and re-enabling it. Adding a drift into the generic code for a specific platform is not the correct solution.

Speaking in terms of the mpc85xx code: At this point, I've no clue how to find the information for setting up various(?) alternatives to the mpc85xx_rcpm_freeze_timebase/mpc85xx_guts_freeze_timebase implementations as far as what each would need to do or how many alternatives would be needed to cover everything. Also, I've only got access to some old dual-socket PowerMac G4s and G5s to confirm any tentative understanding with.

sys/powerpc/powerpc/machdep.c
288

I do have the old Mac OS X Internals book around. It indicates on page 192 that "it is the operating system's responsibility to initialize the TB". But it is not explicit about the BSP vs. the APs where it makes the claim --or about different aspects of initialize. The context used in the book is the 970FX which has 2 major modes of operation for the TB: tied to processor clock frequency vs. tied to rising edge of TBEN. That may be why there is the reference. Apple seems to have used the rising edge of TBEN to avoid variability base on processor frequency by providing a fixed frequency to TBEN and configuring to use that mode.

Speaking in terms of the mpc85xx code: At this point, I've no clue how to find the information for setting up various(?) alternatives to the mpc85xx_rcpm_freeze_timebase/mpc85xx_guts_freeze_timebase implementations as far as what each would need to do or how many alternatives would be needed to cover everything. Also, I've only got access to some old dual-socket PowerMac G4s and G5s to confirm any tentative understanding with.

You can look in Linux's arch/powerpc/platform/powermac/smp.c, and arch/powerpc/kernel/smp-tbsync.c for how Linux does it. Their generic tbsync handshake is similar to yours.

sys/powerpc/powerpc/machdep.c
288

Looking at the xnu-1504.15.3 sources, it restores the TB value via mttb family instructions after hibernate_machine_init() because that could take minutes and timeouts should not fire yet for that OS (according to the comments). It saves the TB value before PE_cpu_machine_quiesce(...) is called. The general cpu_init restores non-zero (64-bit) TB values as well. May be this area is what the book was referring to for the wording that I quoted.

I do not see anything analogous to my mttb(0) use that you commented on.

sys/powerpc/powerpc/machdep.c
288

I looked at xnu and found a hook for plugging in a time_base_enable routine (via a function pointer in a field of a struct), the hook being conditional on the pointer being non-NULL and having a false/true indication parameter for disable vs. enable. But I found nothing ever plugged in for ppc: apparently always NULL. As far as I can tell, they did not disable the timebases for their TB synchronizations, tolerating the resulting variability.

. . .
I do intend to shortly (next few days) clean up the one additional platform_timebase_sync() user (cpu_sleep()) to not use it, and handle it directly, since the code in cpu_sleep() is actually mpc74xx-specific, it's not generic at all. So once that's made explicit, a proper rendezvous sync should be done in platform_powermac.c instead.
. . .

Looking shows 4 uses of platform_smp_timebase_sync in the -r358132 powerpc code from svn:

static void cpu_mp_unleash(void *dummy) does: platform_smp_timebase_sync(ap_timebase, 0)
       void machdep_ap_bootstrap(void)  does: platform_smp_timebase_sync(ap_timebase, 1)
       void cpu_sleep()                 does: platform_smp_timebase_sync(timebase, 0)
       void cpudep_ap_setup()           does: platform_smp_timebase_sync(0, 1)

That last has a comment reporting: /* The following is needed for restoring from sleep. */

Are you also removing the usage from cpudep_ap_setup(), not just the usage from cpu_sleep()?

. . .
You can look in Linux's arch/powerpc/platform/powermac/smp.c, and arch/powerpc/kernel/smp-tbsync.c for how Linux does it. Their generic tbsync handshake is similar to yours.

Taking a quick look, I do not see myself attempting freeze_timebase support for any platforms that I do not have around to test. I do not have much around compared to the variety that exist.

Even for covering freeze_timebase and its use for just the 3 types of 2-socket PowerMacs that I (sometimes) have access to, it seems more likely that I'd first cover not having free_timebase support (the fallback). Then, one by one, I'd add a type-specific freeze_timebase and its use for one type of PowerMac context that I have around (at the time): a sequence of updates instead of one with everything. (One point of my general "fallback" technique is that the math for contexts with end round trip time == start round trip time effectively copies the bsp time reported to the ap over to the ap unchanged. A simple copy operation for contexts known to be that way would be an optimization instead of a fundamental change.)

A question is if the fallback code should be for everything missing freeze_timebase support vs. closer to my limited range of test contexts. If "everything", then I think testing temporary variant(s) of the existing code with the alternate_timebase_sync_style=1 example(s) placed to allow the test context(s) should happen first. There is not much point in a rewrite effort that uses the fallback if the fallback technique proves insufficient to avoid the (temporarily) stuck sleeping issue as generally as required --or is problematical in some other way for contexts that I do not have access to examples of. More likely this specific submittal would be abandoned for such problems, pending an alternate fallback technique being found.

FYI: Both of the "PowerMac11,2 with 2 sockets, 2 cores each" have died of cooling system failures. (I can sometimes boot one, but it is problematical to do much with without overheating.) So, now the only G5 I now have access to for general testing is a "2 socket/1-core-each PowerMac7,2" . I still have access to the two PowerMac3,6 with "2 sockets, 1 core each" G4 examples. (But there are other major FreeBSD kernel issues for 32-bit PowerMacs, even for single socket ones: it gradually clears some user-process pages to zero.)

I no longer have access to any operational old-PowerMacs. (And never had access to any other example types of PowerPC/Power hardware.) As such, I'll no longer be involved in powerpc64 or 32-bit powerpc activities.