Page MenuHomeFreeBSD

amd64 pmap: fix pcid invalidations
ClosedPublic

Authored by kib on Dec 12 2020, 4:58 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jun 6, 10:19 AM
Unknown Object (File)
Apr 26 2024, 10:19 AM
Unknown Object (File)
Apr 26 2024, 10:19 AM
Unknown Object (File)
Apr 26 2024, 10:19 AM
Unknown Object (File)
Apr 26 2024, 10:19 AM
Unknown Object (File)
Apr 26 2024, 10:19 AM
Unknown Object (File)
Apr 26 2024, 10:19 AM
Unknown Object (File)
Apr 26 2024, 3:01 AM
Subscribers

Details

Summary

When r362031 moved local TLB invalidation after shootdown IPI send, it moved too much. In particular, PCID-mode clearing of the pm_gen must occur before IPIs are send, which is in fact described by the comment before seq_cst fence in the invalidation functions.

Fix it by extracting pm_gen clearing into new helper pmap_invalidate_preipi(), which is executed before a call to smp_masked_invlpg().

Rest of the local invalidation callbacks is simplified as result, and become very similar to the remote shootdown handlers (to be merged in some future).

Reported and tested by: mjg

[List of reviewers is identical to that of D25188]

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped
Build Status
Buildable 35400

Event Timeline

kib requested review of this revision.Dec 12 2020, 4:58 PM
cem added inline comments.
sys/amd64/amd64/mp_machdep.c
676–683

If I'm reading this correctly, we no longer need sched_pin because we are pinned on entry in pmap_invalidate_foo (-> smp_masked_invlfoo)? Is that an invariant we should check or document?

(Also: I'm not sure why the mp_ncpus <= 1 check isn't at the top-level. It seems like we can avoid checking the mask entirely if we are running on a UP system and that will be a bit cheaper than checking a bitset vector. If we are not running on a UP system, CPU_ISFULLSET doesn't save us anything compared to CPU_CLR + CPU_EMPTY bitset vector scan. Maybe I am missing something.)

sys/amd64/amd64/pmap.c
2787–2788

Do we need any specific ordering between zeroing local pm_gen and remote CPU pm_gen? Or do we only care that they are ordered relative to the subsequent IPIs?

(This is x86 MD code so maybe we can assume TSO without being explicit about it, but maybe we need a __compiler_membar() to prevent cc from reordering the stores?)

2874–2885

(Not new in this revision; feel free to ignore.)

Is the idea here that the compiler probably inlines pmap_invalidate_page_pcid_cb into these callers with invpcid_works1 optimized out? Or what is the point of IFUNCing them if they're just going to call into a shared function with a runtime branch?

This revision is now accepted and ready to land.Dec 12 2020, 6:49 PM
kib marked 3 inline comments as done.Dec 12 2020, 8:06 PM
kib added inline comments.
sys/amd64/amd64/mp_machdep.c
676–683

Assert for td_pinned added.

WRT checking mp_ncpus <= 1 at the very start, this would be an optimization for the wrong case IMO. This code is for amd64, and mp_ncpus is almost always > 1.

It might be you are right that if () and its then branch can be removed outright. But this is for a follow-up change. Feel free to propose it.

sys/amd64/amd64/pmap.c
2787–2788

There are no ordering requirements there. We care about ordering with context switches. If switch activates our pmap, it must either see pm_gen == 0 or get IPI after (context switch has interrupts disabled). Basically this is what explained in the comment right above seq_cst.

2874–2885

Not probably, although I did not rechecked. When I initially wrote the code I did verified that compiler inlined pmap_invalidate_page_pcid_cb() and cut branches based on the static value of invpcid_works1. This was the reason for introducing the argument vs using invpcid_works directly.

sys/amd64/amd64/mp_machdep.c
676–683

Hm, it seems like we only test ISFULLSET to subsequently check mp_ncpus (UP) and proceed to local_cb. I agree it seems unreasonable on amd64 to optimize for UP systems, but the existing code already does the UP check first. Also I agree that it is a follow-up change, not part of this change.

Removing first if() case and body should perform the same goal: on UP system, a valid mask should contain only PCPU_GET(cpuid) and CPU_EMPTY() will be true. So we avoid the ISFULLSET() check entirely.

sys/amd64/amd64/pmap.c
2787–2788

Great, thanks.

2874–2885

Thanks.

sys/amd64/amd64/mp_machdep.c
637–638

"... by the pmap ..."

638–639

"which" -> "that"

641–642

"... in a critical ..."

642

"... for the remote ..."

kib marked 7 inline comments as done.

Fix grammar in the comment.
Assert pinned in smp_targeted_tlb_shootdown().

This revision now requires review to proceed.Dec 12 2020, 9:33 PM
sys/amd64/amd64/pmap.c
2791

Consider eliminating the earlier "if" statement and instead testing "cpuid != i || pmap != PCPU_GET(curpmap)" here.

2830–2833

I'm confused by this sentence. Previously, it was outside a critical section, and so preemption was possible. Now, however, the below CRITICAL_ASSERT implies that we are in a critical section.

kib marked 4 inline comments as done.Dec 13 2020, 3:02 AM
kib added inline comments.
sys/amd64/amd64/pmap.c
2791

Instead of complicating expression checked for each CPU, I reset cpuid to -1 so that curcpu is not skipped automatically. Then there is only one place that assigns zero to pm_gen, perhaps this is what you suggested.

kib marked an inline comment as done.

Move CRITICAL_ASSERT() earlier in callbacks; update callback comment about handling of nopti/ucr3_load_mask.
Leave only one place which clears pm_gen.

kib marked an inline comment as done.Dec 13 2020, 3:09 AM
sys/amd64/amd64/pmap.c
2911

What do you think of

  1. renaming pmap_invalidate_*_curcpu_cb() to pmap_invalidate_*_postipi_cb(), and
  2. renaming callees of pmap_invalidate_*_curcpu_cb() to replace _cb with _postipi

?

For example,

pmap_invalidate_range_curcpu_cb() -> pmap_invalidate_range_postipi_cb()

and

pmap_invalidate_range_pcid_invpcid_cb() -> pmap_invalidate_range_pcid_invpcid_postipi()
pmap_invalidate_range_pcid_noinvpcid_cb() -> pmap_invalidate_range_pcid_noinvpcid_postipi()
pmap_invalidate_range_nopcid_cb() -> pmap_invalidate_range_nopcid_postipi()
sys/amd64/amd64/pmap.c
2911

I do not mind renaming. But note that post-ipi call place is rather random, it is done this way for optimization and not due to functional requirements. On the other hand, the fact that callbacks are called on the callers' CPU is quite important.

sys/amd64/amd64/pmap.c
2791

Yes, that works.

sys/amd64/amd64/pmap.c
2911

But note that post-ipi call place is rather random, it is done this way for optimization and not due to functional requirements.

The suggested rename just makes the code easier to follow in my opinion.

On the other hand, the fact that callbacks are called on the callers' CPU is quite important.

Additional CRITICAL_ASSERT assertions might make this more clear.

sys/amd64/amd64/pmap.c
2911

Where do you want to see CRITICAL_ASSERT()s ? In each curpu_cb() ?

sys/amd64/amd64/pmap.c
2814–2817

Please consider the possibility of making pmap_invalidate_preipi responsible for performing the sched_pin() as its very first or only action.

2831

I think that it would be clearer to use the word "preemption" here.

2839

I think that "deferred until the return ..." would be clearer.

2911

In regards to the suggested renaming, I agree with Kostik. I would characterize the sched_unpin() as a post-IPI action, but not what the callbacks do.

kib marked 6 inline comments as done.Dec 13 2020, 9:53 PM
kib added inline comments.
sys/amd64/amd64/pmap.c
2911

Ok, but I do not much use in adding pmap_invalidate_postipi() as a trivial wrapper around sched_unpin().

kib marked an inline comment as done.

More grammar fixes.

Move sched_pin() into pmap_invalidate_preipi(). Make smp_masked_invltlb() do sched_unpin().

sys/amd64/amd64/mp_machdep.c
752–753

I think it's better to call sched_unpin() and critical_exit() in the opposite order: if the thread owes preemption, it will switch away in the critical_exit() call, but it can only be scheduled on curcpu until sched_unpin() is called.

sys/amd64/amd64/pmap.c
2911

Let's drop the proposal then.

I have nothing more to add. Once you've dealt with Mark's latest comment, commit it. I don't see the need for me to look at this change again.

This revision is now accepted and ready to land.Dec 14 2020, 8:03 PM
kib marked 2 inline comments as done.

Unpin before critical exit.

This revision now requires review to proceed.Dec 14 2020, 9:24 PM
This revision is now accepted and ready to land.Dec 14 2020, 9:58 PM
bdrewery added inline comments.
sys/amd64/amd64/pmap.c
3008

@kib From my limited understanding this change was mostly affecting user mapped pages. But I do see that this adds in a sched_pin, *.pm_gen = 0, and atomic_thread_fence_seq_cst before the kernel space invalidations too. Is there any significance there in terms of fixed functionality for kernel pages?

sys/amd64/amd64/pmap.c
3008

Kernel pmap is formally active on all CPUs, and must never be deactivated. In other words, it is never the ongoing or outgoing target of pmap_activate{,_sw}(). TLB consistency handling cannot be delayed for kernel pmap.

If I understand your question right, then:

  • sched_pin() is absolutely required for correctness of the IPI operation
  • clearing of the pm_gen is not important for the correctness of TLB invalidation operation, it is short-circuited in pmap_pcid_alloc(), see the first check for PMAP_PCID_KERN
  • seq_cst() fence in preipi is probably not important but I did not analyzed it completely

Avoiding zeroing pm_gen for kernel_pmap is probably not worth it. Removing seq_cst() barrier when invalidation is done for kernel_pmap might be a reasonable but very minor optimization, since the cost of broadcast IPI makes this additional lock;addl non-observable, I believe, even if it can be removed.