Page MenuHomeFreeBSD

amd64: Ensure that the state of the switched-out thread is fully flushed
ClosedPublic

Authored by kib on Oct 13 2019, 2:04 PM.

Details

Summary

so it is visible to a CPU which might pick up the thread for execution.

On Intel, all writes, in particular write-combining store buffers, are flushed by a locked operation, so the thread lock taken on the switch part should be enough. On some AMD models, the APM is self-contradictory: one place states that the locked operation is enough, but the description of CLFLUSH is explicit to mention that on models without CLFLUSHOPT only MFENCE would do it.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

kib created this revision.Oct 13 2019, 2:04 PM
cem accepted this revision.Oct 14 2019, 6:10 PM
cem added a subscriber: cem.
This revision is now accepted and ready to land.Oct 14 2019, 6:11 PM
jhb added a subscriber: jhb.Oct 14 2019, 6:17 PM

Does i386 require a similar fix (or do we not care enough about i386 to bother?)

kib added a comment.Oct 14 2019, 6:23 PM
In D22007#481089, @jhb wrote:

Does i386 require a similar fix (or do we not care enough about i386 to bother?)

Of course is does, but there are additional complications like absence of SSE2. So I want to get to the agreement for amd64 first/

cem added inline comments.Oct 14 2019, 6:26 PM
sys/amd64/amd64/cpu_switch.S
62–63 ↗(On Diff #63210)

Do we need a fence here as well?

kib added inline comments.Oct 14 2019, 6:42 PM
sys/amd64/amd64/cpu_switch.S
62–63 ↗(On Diff #63210)

I do not think so. The called thread context is thrown out, it is the code responsibility to ensure that all effects are stable before the thread is exiting.

alc added a comment.Oct 14 2019, 8:15 PM

I'm confused as to why this is necessary. pmap_activate_sw() performs a serializing instruction, specifically, a move to cr3. And, Section 8.2.5 of Volume 3 says,

Program synchronization can also be carried out with serializing instructions (see Section 8.3). These instructions
are typically used at critical procedure or task boundaries to force completion of all previous instructions before a
jump to a new section of code or a context switch occurs. Like the I/O and locking instructions, the processor waits
until all previous instructions have been completed and all buffered writes have been drained to memory before
executing the serializing instruction.

And, the next paragraph discusses the use of fences, e.g., sfence, instead of serializing instructions, e.g., cpuid.

kib added a comment.EditedOct 14 2019, 8:27 PM
In D22007#481114, @alc wrote:

I'm confused as to why this is necessary. pmap_activate_sw() performs a serializing instruction, specifically, a move to cr3. And, Section 8.2.5 of Volume 3 says,

Program synchronization can also be carried out with serializing instructions (see Section 8.3). These instructions
are typically used at critical procedure or task boundaries to force completion of all previous instructions before a
jump to a new section of code or a context switch occurs. Like the I/O and locking instructions, the processor waits
until all previous instructions have been completed and all buffered writes have been drained to memory before
executing the serializing instruction.

And, the next paragraph discusses the use of fences, e.g., sfence, instead of serializing instructions, e.g., cpuid.

If old and new vmspaces are same, then pmap_activate_sw() does nothing.
Do you prefer to have SFENCE on check for oldpmap == pmap ?

alc added a comment.Oct 14 2019, 9:46 PM

Intel SDM states that interrupts and exceptions flush store buffers, but this is said in context of WB memory.

Which section?

cem added a comment.EditedOct 14 2019, 11:42 PM
In D22007#481143, @alc wrote:

Intel SDM states that interrupts and exceptions flush store buffers, but this is said in context of WB memory.

Which section?

Vol. 3A, Chapter 11 "MEMORY CACHE CONTROL" (and specifically for store buffers in §11.10 "STORE BUFFER").

The processor ensures that ... the contents of the store buffer are always drained to memory in the following situations:
• When an exception or interrupt is generated.

But the intro says this:

Write Combining (WC) ... If the WC buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution.

The list is oddly specific and does not include 'mov to control registers,' but the "such as" language and cpuid example suggest to me that setting cr3 may be sufficient (if, as Konstantin points out, we actually did so).

(Note that in Table 11-1 "Write Combining buffers" are distinct from ordinary "store buffers;" Core 2, e.g., has 8 WC entries and 20 store buffer entries.)

Also... only stores seem to be reliably ordered for NT / WC operations on modern Intel: see long history of errata for store-load barriers on WC/NT operations here: https://stackoverflow.com/a/50279772 (on Skylake, both lock; add ... and mfence seem to be broken as barriers for store-load ordering of NT/WC stores!).

kib added a comment.Oct 15 2019, 7:11 AM

I remembered why I did not wanted to put the SFENCE instruction in pmap_activate_sw(). Description of e.g. CLFLUSHOPT (as well as e.g. AMD CLZERO) explicitly state that SFENCE is required, they do not mention serialization instructions.

cem added a comment.Oct 15 2019, 4:40 PM
In D22007#481240, @kib wrote:

I remembered why I did not wanted to put the SFENCE instruction in pmap_activate_sw(). Description of e.g. CLFLUSHOPT (as well as e.g. AMD CLZERO) explicitly state that SFENCE is required, they do not mention serialization instructions.

Vol 2A, p.3-144:

Executions of the CLFLUSHOPT instruction are ordered with respect to fence instructions and to locked read-modify-write instructions; they are also ordered with respect to the following accesses to the cache line being invalidated: older writes and older executions of CLFLUSH. They are not ordered with respect to writes, executions of CLFLUSH that access other cache lines, or executions of CLFLUSHOPT regardless of cache line; to enforce CLFLUSHOPT ordering with any write, CLFLUSH, or CLFLUSHOPT operation, software can insert an SFENCE instruction between CLFLUSHOPT and that operation.

The "can" language in that last sentence is unclear to me. It seems possible that some other serializing instruction might be adequate, although I agree it is unclear why those others are not spelled out in the list above.

I find the CLWB language a little more precise (p. 3-149):

CLWB instruction is ordered only by store-fencing operations. For example, software can use an SFENCE, MFENCE, XCHG, or LOCK-prefixed instructions to ensure that previous stores are included in the write-back.

Ok, are serializing events not store-fencing operations? Or is Intel just providing examples of performant ways to order these operations rather than an exhaustive list?

Unfortunately, my NDA AMD docs are only for first-gen Zen which does not provide any CLFLUSHOPT or CLWB, so I only have the newer public docs for Zen 2.

The (public) #56305 "Software Optimization Guide for AMD Family 17h Model 30h and Greater" document gives a table of events that "complete" a pending write-combining cache line, including: IO r/w; any serializing instruction, including MOV CRx; Locks; UC reads; buffer full; sfence/mfence; and interrupts/exceptions.

I guess that's orthogonal from store buffers, and again that doc doesn't have CLFLUSHOPT or CLWB (or CLZERO).

AMD 's "AMD64 Architecture Programmer’s Manual Volume 1: Application Programming" (#24592) has a §3.9.2 "Forcing Memory Order" which describes the FENCE instructions:

Special instructions are provided for application software to force memory ordering in situations where such ordering is important. These instructions are:
...
Although they serve different purposes, other instructions can be used as read/write barriers when the order of memory accesses must be strictly enforced. These read/write barrier instructions force all prior reads and writes to complete before subsequent reads or writes are executed. Unlike the fence instructions listed above, these other instructions alter the software-visible state. This makes these instructions less general and more difficult to use as read/write barriers than the fence instructions, although their use may reduce the total number of instructions executed. The following instructions are usable as read/write barriers:

  • Serializing instructions—Serializing instructions force the processor to commit the serializing instruction and all previous instructions, then restart instruction fetching at the next instruction. ...

And the "Vol 2: System Programming" (#24593) enumerates those instructions in §7.6.4, "Serializing instructions:"

Serializing instructions can be used as a barrier between memory accesses to force strong ordering of memory operations. Care should be exercised in using serializing instructions because they modify processor state and may affect program flow. The instructions also force execution serialization, which can significantly degrade performance. When strongly-ordered memory accesses are required, but execution serialization is not, it is recommended that software use the memory-ordering instructions described on page 185.

(p. 185 describes the FENCE instructions.)

The following are serializing instructions:
...

  • Privileged Instructions
    • MOV CRn

AMD doesn't have any public documentation of CLFLUSHOPT that I can find, nor any documentation of how CLWB is ordered w.r.t. other instructions (it is only briefly mentioned as a non-invalidating CLFLUSH); maybe the above implies it.

alc added a comment.Oct 15 2019, 4:59 PM
In D22007#481175, @cem wrote:
In D22007#481143, @alc wrote:

Intel SDM states that interrupts and exceptions flush store buffers, but this is said in context of WB memory.

Which section?

Vol. 3A, Chapter 11 "MEMORY CACHE CONTROL" (and specifically for store buffers in §11.10 "STORE BUFFER").

The processor ensures that ... the contents of the store buffer are always drained to memory in the following situations:
• When an exception or interrupt is generated.

But the intro says this:

Write Combining (WC) ... If the WC buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution.

The list is oddly specific and does not include 'mov to control registers,' but the "such as" language and cpuid example suggest to me that setting cr3 may be sufficient (if, as Konstantin points out, we actually did so).

I did a search for the word "serializing" in Volume 3, and this is the first occurrence of the phrase "serializing event", so the author is trying to define its meaning here. And, the author's use of "such as" implies that these are not the only serializing events. Setting aside the mention of interrupts, I think that the author's objective in constructing the list was to give an example from each category of instruction that earlier sections, like Section 8.2.5, "Strengthening or Weakening the Memory-Ordering Model", define as affecting memory ordering.

In other words, I think that the intent here is to say that all serializing instructions are serializing events.

In regards to interrupts, I would have been surprised if interrupts were not serializing events, especially given that iret is a serializing instruction.

(Note that in Table 11-1 "Write Combining buffers" are distinct from ordinary "store buffers;" Core 2, e.g., has 8 WC entries and 20 store buffer entries.)

I would observe two things about this table: (1) some of the entries are wrong or out-of-date, e.g., the STLB size, and (2) there is no mention of "write combining buffers" for the recent micro-architectures.

Also... only stores seem to be reliably ordered for NT / WC operations on modern Intel: see long history of errata for store-load barriers on WC/NT operations here: https://stackoverflow.com/a/50279772 (on Skylake, both lock; add ... and mfence seem to be broken as barriers for store-load ordering of NT/WC stores!).

I only skimmed this. Are there any erratum concerning serializing instructions (as opposed to locked instructions and fences) failing to provide memory ordering?

alc added a comment.Oct 15 2019, 5:15 PM
In D22007#481354, @cem wrote:
In D22007#481240, @kib wrote:

I remembered why I did not wanted to put the SFENCE instruction in pmap_activate_sw(). Description of e.g. CLFLUSHOPT (as well as e.g. AMD CLZERO) explicitly state that SFENCE is required, they do not mention serialization instructions.

Vol 2A, p.3-144:

Executions of the CLFLUSHOPT instruction are ordered with respect to fence instructions and to locked read-modify-write instructions; they are also ordered with respect to the following accesses to the cache line being invalidated: older writes and older executions of CLFLUSH. They are not ordered with respect to writes, executions of CLFLUSH that access other cache lines, or executions of CLFLUSHOPT regardless of cache line; to enforce CLFLUSHOPT ordering with any write, CLFLUSH, or CLFLUSHOPT operation, software can insert an SFENCE instruction between CLFLUSHOPT and that operation.

The "can" language in that last sentence is unclear to me. It seems possible that some other serializing instruction might be adequate, although I agree it is unclear why those others are not spelled out in the list above.
I find the CLWB language a little more precise (p. 3-149):

CLWB instruction is ordered only by store-fencing operations. For example, software can use an SFENCE, MFENCE, XCHG, or LOCK-prefixed instructions to ensure that previous stores are included in the write-back.

Ok, are serializing events not store-fencing operations? Or is Intel just providing examples of performant ways to order these operations rather than an exhaustive list?

I think it is the latter. I think that Section 8.2.5 makes that point:

Note that the SFENCE, LFENCE, and MFENCE instructions provide a more efficient method of controlling memory
ordering than the CPUID instruction.

I believe that cpuid keeps appearing in snippets of text that we have discussed not because it is the only serializing instruction that orders memory but because it can be used more often than the others. It is not privileged, and it doesn't change the control flow.

kib added a comment.Oct 15 2019, 6:37 PM

BTW, I did tried to find a reference in the SDM vol.3 that would certainly witness that interrupts and exceptions are serialized, and failed. I know for sure that sysenter is not serializing.

Also, the CLFLUSHOPT language about locked rwm operations syncing with the instruction is somewhat new to me, I did not remembered it. It might be that Intel changed the definition in this regard. Intel definitely changed the definition for CLFLUSH which was redefined as ordered same as normal writes, and CLFLUSHOPT invented.

alc added a comment.Oct 15 2019, 7:22 PM
In D22007#481438, @kib wrote:

BTW, I did tried to find a reference in the SDM vol.3 that would certainly witness that interrupts and exceptions are serialized, and failed. I know for sure that sysenter is not serializing.

If an application performs a system call in the middle of, for example, a sequence of non-temporal stores, without having performed an sfence first, I would say that the application is broken. :-)

kib added a comment.Oct 15 2019, 7:56 PM
In D22007#481450, @alc wrote:
In D22007#481438, @kib wrote:

BTW, I did tried to find a reference in the SDM vol.3 that would certainly witness that interrupts and exceptions are serialized, and failed. I know for sure that sysenter is not serializing.

If an application performs a system call in the middle of, for example, a sequence of non-temporal stores, without having performed an sfence first, I would say that the application is broken. :-)

May be, I do not object. My note about sysenter was a reaction to the attempt to enumerate serialization instructions, and incomplete attempt to reason them vs. syncing points for CLFLUSH{OPT}. I actually needed to know this for MDS handling (see some recent Intel errata).

kib added a comment.Oct 16 2019, 7:17 AM

So, can we get to some conclusion there, please ? I see two action items:

  1. Addition of SFENCE. I now tend to think that SFENCE should be moved to the else part of 'oldpmap == pmap' in pmap_activate_sw(). Intel changed its description several times, I believe it is the safest way.
  2. From the discussion, I believe that SFENCEs which brace CLFLUSH{OPT} for Intel could be replaced by locked atomic, i.e. atomic_thread_fence_seq_cst().
  3. Additionally, I will try to make a query about this stuff through FF/Intel technical contact.
cem added a comment.Oct 16 2019, 1:23 PM
  1. Agree
  2. To the extent we support CLWB/CLFLUSH (no OPT) on AMD, I think we can use locked instructions to fence on that platform as well.
  3. I am supportive of this but it isn’t a hard requirement.
kib updated this revision to Diff 63363.Oct 16 2019, 2:15 PM

Rework patch as discussed above, changing most of the sfence uses by a locked op.

I put mfence for oldpmap==pmap case of pmap_activate_sw() to handle both amd and intel requirements.

This revision now requires review to proceed.Oct 16 2019, 2:15 PM
cem accepted this revision.Oct 16 2019, 3:07 PM
cem added inline comments.
sys/amd64/amd64/pmap.c
3064–3066 ↗(On Diff #63363)

This is extra information in answer to my own question; I'm not requesting any change here.

The AMD APM vol 3, rev 3.28 is slightly nuanced (p.139).

mfence() sandwich as-used today is required IFF the CPU does not support CLFLUSHOPT.

If the CPU does support CLFLUSHOPT, CLFLUSH is ordered w.r.t. locked ops, fence instructions other than mfence; as well as same-cacheline {clflushopt, clflush, and writes}.

So, the new logic looks correct to me — on future AMD models that support CLFLUSHOPT and have stronger CLFLUSH semantics, we'll just use CLFLUSHOPT anyway due to our existing preference, and the faster locked primitive is adequate.

If there is some theoretical reason we might set useclflushopt=false on a platform with the cpuid bit set, then it might make sense to optimize the fencing on CLFLUSH. But I do not know any reason we would do that.

3079–3080 ↗(On Diff #63363)

Again, not a request for any change.

I'm not sure the strong mfence is actually needed afterwards on AMD. The language (for both clflush and clflushopt in APM 3.28) is confusing to me:

Speculative loads initiated by the processor, or specified explicitly using cache-prefetch instructions, can be reordered around a CLFLUSH instruction. Such reordering can invalidate a speculatively prefetched cache line, unintentionally defeating the prefetch operation. The only way to avoid this situation is to use the MFENCE instruction after the CLFLUSH instruction to force strong-ordering of the CLFLUSH instruction with respect to subsequent memory operations.

(CLFLUSHOPT language is identical):

Speculative loads initiated by the processor, or specified explicitly using cache-prefetch instructions, can be reordered around a CLFLUSHOPT instruction. Such reordering can invalidate a speculatively prefetched cache line, unintentionally defeating the prefetch operation. The only way to avoid this situation is to use the MFENCE instruction after the CLFLUSHOPT instruction to force strong ordering of the CLFLUSHOPT instruction with respect to subsequent memory operations.

An invalidated prefetech cacheline doesn't sound like a correctness problem to me. On the other hand, the CLFLUSH section does explicitly say that non-CLFLUSHOPT models do nor order CLFLUSH against LFENCE, SFENCE, or serializing instructions. So I'm not sure what cheaper store-store barrier would be safe. Maybe none.

3102 ↗(On Diff #63363)

AMD APM Vol 3 3.28 language for CLWB is:

The CLWB instruction is weakly ordered with respect to other instructions that operate on memory. … To create strict ordering of CLWB use a store-ordering instruction such as SFENCE.

The "such as" suggests SFENCE is not the only option, but I'm not sure on the semantics of "store-ordering instructions." It is the only use of the term in the document. Do locked instructions count as "store-ordering?"

(It's not relevant to CLWB, but sort of similar: the CLZERO language says something slightly different:)

CLZERO is weakly-ordered with respect to other instructions that operate on memory. Software should use an SFENCE or stronger to enforce memory ordering of CLZERO with respect to other store instructions.

3135–3136 ↗(On Diff #63363)

static inline variants that take a bool barrier could be added.

9408–9415 ↗(On Diff #63363)

Orthogonal to this revision, but I think this logic may be sort of incorrect and should more closely mirror the pmap_large_map_flush_range ifunc selection. (Condition first on features, then vendor if we must.) mfence is stronger than needed in at least some cases of AMD CPU. Or may be needed on one side but not the other.

This revision is now accepted and ready to land.Oct 16 2019, 3:07 PM
jhb accepted this revision.Oct 16 2019, 4:15 PM

My only request is to perhaps do the mfence in pmap_activate_sw as a separate commit from the sfence -> atomic changes if you weren't already planning to do so.

emaste added a subscriber: emaste.Oct 17 2019, 5:12 PM
scottph added a subscriber: scottph.Fri, Nov 1, 7:43 PM
In D22007#481175, @cem wrote:

Write Combining (WC) ... If the WC buffer is partially filled, the writes may be delayed until the next occurrence of a serializing event; such as, an SFENCE or MFENCE instruction, CPUID execution, a read or write to uncached memory, an interrupt occurrence, or a LOCK instruction execution.

The list is oddly specific and does not include 'mov to control registers,' but the "such as" language and cpuid example suggest to me that setting cr3 may be sufficient (if, as Konstantin points out, we actually did so).

In that list, CPUID execution is meant as a representative for execution of any serializing instruction. A future update to the SDM should clarify this.

sys/amd64/amd64/pmap.c
3019 ↗(On Diff #63363)

What is the ordering that we're establishing here? clflushopt is ordered with respect to earlier writes to the same cacheline.

3099 ↗(On Diff #63363)

isn't it the case that the writes we want to flush have either happened on the same processor, so CLWB is implicitly ordered with them, or they have happened on another processor and we have since migrated, and the ordering has been established by the thread lock?

8699 ↗(On Diff #63363)

won't the thread lock already establish the ordering that we want here?

kib added inline comments.Sun, Nov 3, 5:37 PM
sys/amd64/amd64/pmap.c
3019 ↗(On Diff #63363)

Right, only to the same cacheline. We want this op to follow normal TSO rules of x86.

3099 ↗(On Diff #63363)

Again, we want this flush to be TSO-consistent with older writes. I suspect this is esp. important there, because the function was added to, and is used with non-coherent hardware.

8699 ↗(On Diff #63363)

Not on AMD.

scottph added inline comments.Mon, Nov 4, 10:48 PM
sys/amd64/amd64/pmap.c
3099 ↗(On Diff #63363)

What I mean is: which earlier stores done by this logical processor and not in the range being flushed are important to be globally visible before cache flushing begins? The fence after makes sense to me (don't go on to tell somebody outside the coherence domain about data in memory until that data is actually in memory), but I don't yet see the objective of the fence before.

8699 ↗(On Diff #63363)

Here's the execution trace I'm considering, tell me if it's wrong or there's something i'm missing:

  • we're in pmap_invalidate_cache_pages() or pmap_flush_cache_range(), looping over CLFLUSHOPTs or CLWBs
  • we get interrupted somehow and will be preempted.
    • it seems there are a few ways for that to happen, but all of them include thread_lock(curthread) which will execute LOCK CMPXCHG.
    • This means all earlier writes to the memory we want to cache-flush are globally ordered before the acquisition of the lock. (For AMD, APM vol 2, sec 7.4.2, table 7-3 shows loads and stores to all memory types will be globally ordered before the locked cmpxchg.
    • Also in AMD's APM: "CLFLUSHOPT is ordered with respect to fence instructions and locked operations"
    • "To create strict ordering of CLWB use a store-ordering instruction such as SFENCE" (vol 2, sec 7.4.2 shows that locked cmpxchg is store-ordering).
  • the thread gets migrated to another cpu.
    • here again thread_lock() will LOCK CMPXCHG, and this lock will be globally ordered after the previous one (by causality; migration to another cpu can't happen "before" preemption), and so too after the stores we want to cache-flush.
  • we execute our next CLFLUSHOPT or CLWB
    • The right data we want to flush out must be visible, transitively through the locked cmpxchg operations, back to the original stores.
kib added inline comments.Fri, Nov 8, 6:37 PM
sys/amd64/amd64/pmap.c
3099 ↗(On Diff #63363)

Fence before the flushing ensures that the writes for clearing cache line are properly ordered, including the writes before the flush.

In fact, please look at the Intel' software optimization manual rev. 042b section 8.4.7, esp. example 8.2.

8699 ↗(On Diff #63363)

Unfortunately AMD manual is self-contradicting. Please look at the description of the CLFLUSH instruction in vol.3, specifically the paragraph explaining CLFLUSH ordering for CPUs which do not implement CLFLUSHOPT.

kib reopened this revision.Sun, Nov 10, 9:49 AM
kib updated this revision to Diff 64143.

The rest of the change, now I limit MFENCE only to non-Intel vendors.

kib edited the summary of this revision. (Show Details)Sun, Nov 10, 9:52 AM
scottph accepted this revision.Mon, Nov 11, 6:23 PM
scottph added inline comments.
sys/amd64/amd64/pmap.c
8699 ↗(On Diff #63363)

I see, you're looking at:

The CLFLUSH instruction may also take effect on a cache line
while stores from previous store instructions are still pending
in the store buffer. To ensure that such stores are included in
the cache line that is flushed, use an MFENCE instruction ahead
of the CLFLUSH instruction. Such stores would otherwise cause
the line to be re-cached and modified after the CLFLUSH
completed. The LFENCE, SFENCE, and serializing instructions are
not ordered with respect to CLFLUSH.

I think the case we're discussing probably still can't cause this CLFLUSH to miss those earlier stores because the migration. That is, even though this CLFLUSH will pass store-ordering instructions, those stores from the other processor's store buffer have to have been flushed so that the new processor is able to see that the thread is available for migration.

But here I don't think we can build a fully airtight case for that because this CLFLUSH is so weakly ordered. As specified, it's free to run back in time as far as the last mfence, which... who knows. So I doubt this could happen in practice, but as specified it can.

This revision is now accepted and ready to land.Mon, Nov 11, 6:23 PM