Page MenuHomeFreeBSD

amd64: allow parallel shootdown IPIs

Authored by kib on Jun 29 2020, 9:51 PM.



Stop using smp_ipi_mtx to protect global shootdown state, and move/multiply the global state into pcpu. Now each CPU can initiate shootdown IPI independently from other CPUs. Initiator enters critical section, then fills its local PCPU shootdown info (pc_smp_tlb_XXX), then clears scoreboard generation at location (cpu, my_cpuid) for each target cpu. After that IPI is sent to all targets which scan for zeroed scoreboard generation words. Upon finding such word the shootdown data is read from corresponding cpu' pcpu, and generation is set. Meantime initiator loops waiting for all zeroed generations in scoreboard to update.

Initiator does not disable interrupts, which should allow non-invalidation IPIs from deadlockin, it only needs to disable preemption to pin itself to the instance of the pcpu smp_tlb data.

Handlers loop until they do not see zeroed scoreboard generations. This, together with hardware keeping one pending IPI in LAPIC IRR should prevent lost shootdowns.


  1. The generation is set before the actual invalidation is performed in handler. It is safe because target CPU cannot return to userspace before handler finishes. In principle only NMI can preempt the handler, but NMI would see the kernel handler frame and not touch not-invalidated user page table.
  1. The code does touch LAPIC ICR without exclusion. I believe this is fine because we in fact do not send IPIs from interrupt handlers. More for !x2APIC mode where ICR access for write requires two registers write, we disable interrupts around it. If considered incorrect, I can add per-cpu spinlock around ipi_send().
  1. Scoreboard lines owned by given target CPU can be padded to the cache line, to reduce ping-pong. This is not done in the prototype.

Tested by: pho

Diff Detail

rS FreeBSD src repository - subversion
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

kib requested review of this revision.Jun 29 2020, 9:51 PM
kib edited the summary of this revision. (Show Details)
markj added inline comments.
649 ↗(On Diff #73886)

Can we assert that the old value is non-zero?

1017 ↗(On Diff #73886)


You have an anonymous enum for the op type. The type could be named and used for smp_tlb_op in this function, so you'll get a warning if an op type is missed.

1039 ↗(On Diff #73886)

Aren't "orig" and "req" referring to the same thing, the initiating CPU? The terminology should be consistent. Initiator/target seems best to me but I have no strong preference.

1047 ↗(On Diff #73886)

Why two fences?

1048 ↗(On Diff #73886)


This revision is now accepted and ready to land.Jul 6 2020, 6:10 PM
kib marked 5 inline comments as done.Jul 6 2020, 8:27 PM
kib added inline comments.
1047 ↗(On Diff #73886)

I added a comment explaining. It is formal fence, I need to prevent the write to scoreboard to become visible until we read all pc_smp_tlb values. It is formal (or rather, a compiler barrier) because x86 does not allow such reordering.

kib marked an inline comment as done.

use enum in switch
add comments noting why interrupts must be enabled in shootdown initiator, what for are both fences in shootdown handler, and some other minor details.

This revision now requires review to proceed.Jul 6 2020, 8:28 PM

Fix i386 and XEN/amd64 builds.

markj added inline comments.
643 ↗(On Diff #74155)

This can be KASSERT().

This revision is now accepted and ready to land.Jul 8 2020, 1:58 PM
kib marked an inline comment as done.


BTW I initially considered only extracting internal part of smp_targeted_tlb_shootdown into mp_machdep.c<amd64,i386>, i.e. the part between critical_enter/exit and mtx_lock_spin/mtx_unlock_spin, and did that again now after changing the block to KASSERT. But in both times I convinced myself that it is better to not split the function.

This revision now requires review to proceed.Jul 8 2020, 5:44 PM
kib added a subscriber: pho.
175 ↗(On Diff #74206)

"IPI handler for cache and TLB shootdowns"

206 ↗(On Diff #74206)

"... for cache and TLB invalidations."

541–543 ↗(On Diff #74206)

I would suggest a comment saying something along the lines of these variables are initialized at startup to reflect how each of the different kinds of invalidations should be performed on the current machine and environment.

546 ↗(On Diff #74206)

It would be good to incorporate text from the summary that explains the scoreboard's operation.

594 ↗(On Diff #74206)

"Used by the pmap to request cache or TLB invalidation on ..."

596 ↗(On Diff #74206)

It's not clear what "by vector" refers to.

"As an optimization, the curcpu_cb callback ..."

606 ↗(On Diff #74206)

"which" -> "that"

The use of "which" here implies that all IPIs are protected by spinlocks. In contrast, "that" says that you are talking about the subset of IPIs, which is what you intend. Here is a good rule of thumb:

643 ↗(On Diff #74206)

It would be good to incorporate some of the text from the summary that explains the "synchronization" here, e.g., blocking preemption but not interrupts, avoiding deadlock.

664 ↗(On Diff #74206)

It wouldn't hurt to remind people that the IPI itself acts as a fence between writing to the scoreboard above and reading from it below.

676 ↗(On Diff #74206)

Shouldn't this now be spelled IPI_INVLOP on amd64?

719–720 ↗(On Diff #74206)

I speculate that ordering the parameters to smp_targeted_tlb_shootdown() consistently with smp_masked_invlpg_range() with the "op" being last to the latter will eliminate some move instructions.

730 ↗(On Diff #74206)

This blank line is no longer required.

840 ↗(On Diff #74206)

smp_tlb_pmap is not used.

898 ↗(On Diff #74206)

smp_tlb_pmap is not used.

1057 ↗(On Diff #74206)

I was briefly confused by the use of the word "dual" here. Initially, I read it as relating to the fact that there are two acquire fences in this code. I would suggest that you instead say, "This acquire fence and its corresponding release fence in ..."

1072 ↗(On Diff #74206)

Again, I would use "corresponding" here rather than "dual".

1073 ↗(On Diff #74206)

"... in the IPI ..."

1074 ↗(On Diff #74206)

"... before the APIC ..."

1078 ↗(On Diff #74206)

It would be good to explicitly say that it is okay to ack the request before performing the invalidation.

482–486 ↗(On Diff #74206)

See earlier comments under amd64.

600 ↗(On Diff #74206)

This blank line is no longer required.

627–630 ↗(On Diff #74206)

Dead code?

647 ↗(On Diff #74206)


670 ↗(On Diff #74206)


kib marked 24 inline comments as done.

Handle Alan' notes.

Cosmetics: initialize pc_smp_tlb_gen of BSP to 1, same as for AP. It is not too important because gen number is incremented before use.

This revision was not accepted when it landed; it landed in state Needs Review.Jul 14 2020, 8:37 PM
This revision was automatically updated to reflect the committed changes.