Page MenuHomeFreeBSD

x86: Implement MWAIT support for stopping a CPU
ClosedPublic

Authored by cem on May 1 2019, 11:40 PM.
Tags
None
Referenced Files
Unknown Object (File)
Mar 6 2024, 4:24 PM
Unknown Object (File)
Jan 29 2024, 12:11 PM
Unknown Object (File)
Dec 19 2023, 11:25 PM
Unknown Object (File)
Nov 12 2023, 5:39 AM
Unknown Object (File)
Nov 10 2023, 9:36 AM
Unknown Object (File)
Nov 9 2023, 1:31 PM
Unknown Object (File)
Nov 8 2023, 9:34 AM
Unknown Object (File)
Nov 7 2023, 11:08 AM
Subscribers

Details

Summary

IPI_STOP is used after panic or when ddb is entered manually. This approach
allows CPUs that support the feature to sleep in a low power way instead of
spinning. It is perhaps especially useful in oversubscribed VM
environments, and is safe to use even if the panic/ddb thread is not the
BSP.

It can be tuned/sysctled with "machdep.stop_mwait."

Submitted by: Anton Rang <rang AT acm.org>

Test Plan

Unfortunately, bhyve does not implement MON extensions, and I don't have
another crash box prepared to test this at the moment. However, in hypervisors
that support MON extensions (at least VMware), it can verified by entering ddb
on the guest and confirming that only one core spins, instead of N.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/x86/x86/mp_x86.c
1425 ↗(On Diff #56938)

I do not understand this range_can_be_monitored() business. Why don't you use normal per-cpu pc_monitorbuf for this purpose ? To reuse store to started_cpus as wakeup signal ?

I do not think that adding a write to pc_monitorbuf is complicated.

1432 ↗(On Diff #56938)

Unfortunately, there are CPU bugs where writes do not wake up MWAIT, see cpu_idle_apl31_workaround in cpu_machdep.c. You must not use MWAIT when this is the case.

Thanks!

sys/x86/x86/mp_x86.c
1425 ↗(On Diff #56938)

To reuse store to started_cpus as wakeup signal ?

Yes, I believe that's the idea. I didn't know about pc_monitorbuf. Your idea would be similar to cpu_idle_mwait / cpu_idle_wakeup?

1432 ↗(On Diff #56938)

Do you think other machines might have such errata in the future (suggesting a more generic name, like broken_mwait)? E.g., it seems like we also disable idle_mwait on Ryzen due to errata.

sys/x86/x86/mp_x86.c
1425 ↗(On Diff #56938)

cpu_idle_mwait() does a lot of unneeded things for the ddb context, e.g. there is no point in ssb/ibrs tweaking. cpu_idle_wakeup() falls back to sending IPI if write does not work, which is, again, not suitable for ddb. If you decided that mwait can be used, simply do write to each pc_monitorbuf.

1432 ↗(On Diff #56938)

I do not mind renaming it, as far as my apollo lake continues to boot.

cem planned changes to this revision.May 2 2019, 6:09 PM
cem added inline comments.
sys/x86/x86/mp_x86.c
1425 ↗(On Diff #56938)

Yep, understood. That was just the only existing consumer of pc_monitorbuf I could find.

I guess we should avoid trampling on cpu_idle_mwait's state (e.g., if user enters DDB outside of panic). Would it make sense to have a MD struct for pc_monitorbuf layout? I'll make a sketch of the idea.

1432 ↗(On Diff #56938)

Ok, perfect.

cem marked 4 inline comments as done.
  • Monitor PCPU pc_monitorbuf rather than started_cpus global
  • Write to pc_monitorbuf during restart_cpus()
  • Detect and do not monitor/mwait on CPUs with mwait errata (apollo lake, ryzen)
sys/kern/subr_smp.c
347 ↗(On Diff #56988)

I suggest to finally do it and split the function into x86 and non-x86 version, perhaps x86 should live in cpu_machdep.c.

sys/x86/include/x86_smp.h
71 ↗(On Diff #56988)

I think this should go directly into machine/pcpu.h and pcpu MD put the typed member for pc_monitorbuf (with the padding added to the structure). It would remove all casts.

sys/x86/x86/cpu_machdep.c
741 ↗(On Diff #56988)

You left the apl31_workaround so that Apollo can still use mwait, right ? I believe that the name 'mwait_wakeup_broken' is not telling enough, it only designates mwait as broken for cpustop_handler, so why not call it mwait_cpustop_broken (or any variation) ?

sys/x86/x86/mp_x86.c
1414 ↗(On Diff #56988)

(cpu_feature2 & CPUID2_MON) != 0

Thanks!

sys/kern/subr_smp.c
347 ↗(On Diff #56988)

I would prefer not to split this into N MD copies, if possible. Future changes are more clear and easier to apply to a single MI copy than N MD copies, at least for me. Even with some ifdefs.

(I don't know a lot about X86 suspend — are the parts that are X86 today going to be x86-specific forever, or are they potentially shared with any other arch implementing suspend?)

Maybe the monitor wakeup logic could be moved to an MD routine that most arch #define out? Then generic_restart_cpus gains only one line of code and it needn't be under #ifdef X86. It is maybe plausible another arch may implement a similar primitive.

sys/x86/include/x86_smp.h
71 ↗(On Diff #56988)

Sure, that seems reasonable to me. Will do!

sys/x86/x86/cpu_machdep.c
741 ↗(On Diff #56988)

You left the apl31_workaround so that Apollo can still use mwait, right ?

Yep.

I believe that the name 'mwait_wakeup_broken' is not telling enough, it only designates mwait as broken for cpustop_handler, so why not call it mwait_cpustop_broken (or any variation) ?

I don't feel strongly about the name; I chose "wakeup broken" because that seems to be the broader category than just cpustop. E.g., on apollo lake it seems like the wakeup half of mwait is broken, which is why the scheduler needs to wake the MWAITing idle thread with a preemption instead of just relying on the pc_monitorbuf write to wake the thread. So sure, we have a workaround when interrupts are available, but MWAIT wakeup is still unreliable (broken).

I'm happy to change it to the more specific mwait_cpustop_broken if you prefer, though.

sys/x86/x86/mp_x86.c
1414 ↗(On Diff #56988)

I always hear from JHB that (foo & bitmask) is acceptable for boolean conditions, but I'm happy to make the change.

cem marked 3 inline comments as done.
  • Rename "mwait doesn't wakeup on write" variable as requested
  • Use != 0 for bitwise -> boolean conversion, per style(9)
  • Move monitorbuf structure definition into pcpu pc_monitorbuf definition, keeping 128 bytes total size with padding.
  • Drop casts no longer needed by moving monitorbuf into MD pcpu.h

I suggest to commit the definition of struct monitorbuf without stop_state member, and related simplifications, now.

sys/amd64/include/pcpu.h
51 ↗(On Diff #57016)

There is no need in tab after '}'.

Also I am not sure why declaring non-anonymous structure nested is better than provide its definition standalone.

sys/kern/subr_smp.c
391 ↗(On Diff #57016)

This is redundant check, CPU_FOREACH() already does exactly the same test.

347 ↗(On Diff #56988)

It only requires two copies now, not N. One for x86, one for everything else. I even think now that it can be left where it is, bracketed with #if X86 ... #else .. #endif.

In D20135#434059, @kib wrote:

I suggest to commit the definition of struct monitorbuf without stop_state member, and related simplifications, now.

Ok, will do.

sys/amd64/include/pcpu.h
51 ↗(On Diff #57016)

Will fix the tab.

I don't have any preference about where the structure is defined; I am happy to move it out of the macro.

sys/kern/subr_smp.c
391 ↗(On Diff #57016)

They're slightly different. map contains only the CPUs to be restarted, whereas CPU_FOREACH iterates all cores in the system.

347 ↗(On Diff #56988)

Ok

This revision was not accepted when it landed; it landed in state Needs Review.May 4 2019, 5:35 PM
This revision was automatically updated to reflect the committed changes.
cem marked an inline comment as done.

reopening -- first patch was the non-functional change kib@ suggested go in first

cem marked 2 inline comments as done.
  • Rebase patch on r347129
  • Separate generic_restart_cpus into less granular !X86 and X86 versions with more duplication
  • Moved monitor stop_state STOPPED set earlier in cpustop_handler to order correctly with stopped_cpus in the case that a remote CPU immediately restarts the local CPU after issuing a stop.
sys/kern/subr_smp.c
391 ↗(On Diff #57042)

Why the MFENCE is needed ? The only relevant reads are below in 'wait for each to clear its bit' which is harmless if reordered.

sys/kern/subr_smp.c
391 ↗(On Diff #57042)

wmb() is SFENCE, not MFENCE, no?

I believe I took the idea from the part of cpu_reset() which wakes and transfers the shutdown process to the BSP.

Maybe it can be removed here.

sys/kern/subr_smp.c
391 ↗(On Diff #57042)

Yes, it is SFENCE which makes it completely useless.

cem marked 2 inline comments as done.

Remove unneeded SFENCE.

I do not see any use of wmb() in cpu_reset() either. Digging into history shows that wmb() was added in r222813 as replacement for atomic_store_int_rel().

This revision is now accepted and ready to land.May 4 2019, 7:37 PM

Thanks!

In D20135#434160, @kib wrote:

I do not see any use of wmb() in cpu_reset() either. Digging into history shows that wmb() was added in r222813 as replacement for atomic_store_int_rel().

Initially in r221499 on the project branch, yeah. Maybe the use could be converted into CPUSET_ZERO(); CPUSET_SET_ATOMIC(0, &started_cpus); instead, if atomic set of the bit is needed. But maybe the atomic write is not needed at all and we can simply drop wmb(); in cpu_reset().

I think it is an unrelated change so I'm going to leave cpu_reset alone for now, unless you object.

This revision was automatically updated to reflect the committed changes.
In D20135#434163, @cem wrote:

Thanks!

In D20135#434160, @kib wrote:

I do not see any use of wmb() in cpu_reset() either. Digging into history shows that wmb() was added in r222813 as replacement for atomic_store_int_rel().

Initially in r221499 on the project branch, yeah. Maybe the use could be converted into CPUSET_ZERO(); CPUSET_SET_ATOMIC(0, &started_cpus); instead, if atomic set of the bit is needed. But maybe the atomic write is not needed at all and we can simply drop wmb(); in cpu_reset().

I think it is an unrelated change so I'm going to leave cpu_reset alone for now, unless you object.

You do not need any atomics there. 2011-old atomic_store_int_rel() did have a cpu barrier, so when atomic was replaced by non-atomic SET(), the barrier was added after it to compensate. Later work clarified semantic of _rel() and on x86 ISA already provides ordering guarantees for stores (x86TSO), so our atomic_store_int_rel() is plain store and barrier-less. For the same reason wmb() there does nothing.