Page MenuHomeFreeBSD

amd64: use INVLPGB for kernel pmap invalidations
Needs ReviewPublic

Authored by kib on May 13 2024, 11:34 PM.
Tags
None
Referenced Files
Unknown Object (File)
Wed, Jul 10, 6:08 PM
Unknown Object (File)
Sun, Jul 7, 6:12 AM
Unknown Object (File)
Mon, Jul 1, 12:14 PM
Unknown Object (File)
Sat, Jun 29, 12:18 PM
Unknown Object (File)
Sat, Jun 29, 11:53 AM
Unknown Object (File)
Sat, Jun 29, 10:58 AM
Unknown Object (File)
Thu, Jun 27, 8:54 PM
Unknown Object (File)
Jun 13 2024, 6:30 PM
Subscribers

Details

Reviewers
emaste
alc
markj
Summary

Newest AMD CPUs have the new broadcast invpg instruction INVLPGB. It performs invalidation locally and broadcasts the request for same invalidation to all logical CPUs. I tried to apply it to our pmap, but it is not convenient due to our use of the sliding PCID algorithm.

But I do think that one use of the instruction could be very profitable, esp. on large machines. We can do kernel pmap invalidations without IPIs. These invalidations always use PCID 0, and are always totally broadcast. This seems to be an ideal application.

I suspect that intense kmalloc/malloc(9) loads like ZFS might get significant speedup due to elimination of IPIs.

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

kib requested review of this revision.May 13 2024, 11:34 PM

Not tested at all, machine is not yet available. I want to discuss this before debugging.

sys/amd64/amd64/mp_machdep.c
729

I don't understand the subtraction here.

734

Should the second parameter logically be PMAP_PCID_KERN?

sys/amd64/include/cpufunc.h
540

The parameter names should reflect how they are actually used.

kib marked 2 inline comments as done.May 14 2024, 11:27 PM
kib added inline comments.
sys/amd64/amd64/mp_machdep.c
734

We request the global invalidation, and do not request PCID-specific. So it should not be PMAP_PCID_KERN even formally, I believe.

sys/amd64/include/cpufunc.h
540

The doc calls both rax and edx descriptors, unfortunately. And ecx is not a counter as such.

If you have suggestions I will apply it, for me the direct register names are easiest to match with the documentation.

kib marked an inline comment as done.

Fix cnt calculation for 4k step in invlrng.
Flip default for invlpgb_works to 1.

Correct cnt, one more time.

On amd64, translation invalidation routines don't specify whether intermediate TLB entries need to be invalidated, but on arm64 they do. Now that we have INVLPGB_FIN, I wonder if (in a separate patch certainly) it would be worth augmenting amd64's pmap_invalidate_page() etc. to specify whether intermediate entries need to be invalidated.

sys/amd64/amd64/mp_machdep.c
730

How can cnt == 0 be true? va is not required to be page-aligned?

733

This is the maximum count of additional pages; since we subtract 1 below, from my reading of AMD's spec we can technically use cnt = invlpgb_maxcnt + 1 here.

sys/amd64/amd64/pmap.c
555
sys/amd64/include/cpufunc.h
540

I see, maybe va, desc, and count. But yes, the reader will have to look at invlpgb documentation to fully understand what is going on.

kib marked 3 inline comments as done.May 17 2024, 4:27 PM
kib added inline comments.
sys/amd64/amd64/mp_machdep.c
730

That was left from the previous invalid formula.

733

I initialized maxcnt by the cpuid val + 1.

sys/amd64/amd64/pmap.c
555

Note the above two sysctl descr line styles.

kib marked 3 inline comments as done.

+1 for maxcnt
remove cnt == 0 check

sys/amd64/amd64/mp_machdep.c
739

Although it makes no functional difference, I would contend that NPTEPG is the correct constant here, as it represents the number of PTEs in a page table page referenced by a 2MB PDE.

745

I think multiplication by the same constant that we divided by above makes this easier to understand. I suspect that the compiler will implement this as a shift anyway.

kib marked 2 inline comments as done.

Use NPTEPG

markj added inline comments.
sys/amd64/amd64/mp_machdep.c
733

IMO this is a bit confusing. It would be cleaner to keep the global invlpgb_maxcnt equal to the "official" value, and handle the bias locally.

This revision is now accepted and ready to land.May 20 2024, 2:44 PM
kib marked an inline comment as done.

Use raw maxcnt value

This revision now requires review to proceed.May 20 2024, 4:54 PM
This revision is now accepted and ready to land.May 20 2024, 4:55 PM

Successfully booted an updated version from kib's branch

Jun 25 14:57:48 dut kernel: AMD Extended Feature Extensions ID EBX=0x79bef25f<CLZERO,IRPerf,XSaveErPtr,INVLPGB,RDPRU,BE,WBNOINVD,IBPB,INT_WBINVD,IBRS,STIBP,STIBP_ALWAYSON,PREFER_IBRS,SAMEMODE_IBRS,NOLMSLE,INVLPGBNEST,PPIN,SSBD,CPPC,PSFD,BTC_NO,IBPB_RET>

root@dut:~ # sysctl vm.pmap.invlpgb_works
vm.pmap.invlpgb_works: 1

Fixes from Ed' testing on real hw:

  1. Postpone calculating invlpgb_works until AMD ext features word is initialized
  2. Always truncate VA passed to INVLPGB, the instruction does not accept unaligned addresses
  3. Do not leak pin count
  4. Add comment explaining why TLBSYNC is done on same CPU
This revision now requires review to proceed.Tue, Jun 25, 5:08 PM
sys/amd64/amd64/mp_machdep.c
688

Would it make sense to assert that this is true? I see there's such an assertion in smp_targeted_tlb_shootdown_native, but maybe one here would be good for documentary purposes?

sys/amd64/amd64/mp_machdep.c
688

Yes, indeed. The best place to do that is in fact sched_unpin(), and it is somewhat surprising that there is no such assert already. [The code calls sched_unpin() right after the tlbsync()]

D45736

sys/amd64/amd64/mp_machdep.c
727

Suppose addr1 = 0x800, addr2 = 0x1800. Then I'd expect this routine to invalidate translations for 0x0000 and 0x1000, but we have total = 1.

kib marked 2 inline comments as done.Sun, Jun 30, 6:15 PM
kib added inline comments.
sys/amd64/amd64/mp_machdep.c
727

I suspect that ranged invalidation always get the aligned addresses, but fixed. Thanks.

kib marked an inline comment as done.

Fix loop count calculation for ranged invalidation when start/end are not page aligned.