Details

Reviewers

emaste
alc
markj

Commits

rG47656cc1ef1c: amd64: use INVLPGB for kernel pmap invalidations
rGbc4ffcadf268: amd64: add variables indicating INVLPGB works
rG111c7fc2fe21: amd64: add convenience wrappers for INVLPGB and TBLSYNC

Summary

Newest AMD CPUs have the new broadcast invpg instruction INVLPGB. It performs invalidation locally and broadcasts the request for same invalidation to all logical CPUs. I tried to apply it to our pmap, but it is not convenient due to our use of the sliding PCID algorithm.

But I do think that one use of the instruction could be very profitable, esp. on large machines. We can do kernel pmap invalidations without IPIs. These invalidations always use PCID 0, and are always totally broadcast. This seems to be an ideal application.

I suspect that intense kmalloc/malloc(9) loads like ZFS might get significant speedup due to elimination of IPIs.

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kib created this revision.May 13 2024, 11:34 PM

Herald added a subscriber: imp. · View Herald TranscriptMay 13 2024, 11:34 PM

Not tested at all, machine is not yet available. I want to discuss this before debugging.

kib added a parent revision: D45188: AMD CPUs: update bits and data from CPUID 0x8000_0008.May 13 2024, 11:36 PM

markj added inline comments.May 14 2024, 4:34 PM

sys/amd64/amd64/mp_machdep.c
735 ↗	(On Diff #138510)	I don't understand the subtraction here.
740 ↗	(On Diff #138510)	Should the second parameter logically be PMAP_PCID_KERN?
sys/amd64/include/cpufunc.h
540	The parameter names should reflect how they are actually used.

kib marked 2 inline comments as done.May 14 2024, 11:27 PM

kib added inline comments.

sys/amd64/amd64/mp_machdep.c
740 ↗	(On Diff #138510)	We request the global invalidation, and do not request PCID-specific. So it should not be PMAP_PCID_KERN even formally, I believe.
sys/amd64/include/cpufunc.h
540	The doc calls both rax and edx descriptors, unfortunately. And ecx is not a counter as such. If you have suggestions I will apply it, for me the direct register names are easiest to match with the documentation.

Fix cnt calculation for 4k step in invlrng.
Flip default for invlpgb_works to 1.

Correct cnt, one more time.

On amd64, translation invalidation routines don't specify whether intermediate TLB entries need to be invalidated, but on arm64 they do. Now that we have INVLPGB_FIN, I wonder if (in a separate patch certainly) it would be worth augmenting amd64's pmap_invalidate_page() etc. to specify whether intermediate entries need to be invalidated.

sys/amd64/amd64/mp_machdep.c
736 ↗	(On Diff #138560)	How can `cnt == 0` be true? `va` is not required to be page-aligned?
739 ↗	(On Diff #138560)	This is the maximum count of additional pages; since we subtract 1 below, from my reading of AMD's spec we can technically use `cnt = invlpgb_maxcnt + 1` here.
sys/amd64/amd64/pmap.c
555 ↗	(On Diff #138560)
sys/amd64/include/cpufunc.h
540	I see, maybe `va`, `desc`, and `count`. But yes, the reader will have to look at invlpgb documentation to fully understand what is going on.

kib marked 3 inline comments as done.May 17 2024, 4:27 PM

kib added inline comments.

sys/amd64/amd64/mp_machdep.c
736 ↗	(On Diff #138560)	That was left from the previous invalid formula.
739 ↗	(On Diff #138560)	I initialized maxcnt by the cpuid val + 1.
sys/amd64/amd64/pmap.c
555 ↗	(On Diff #138560)	Note the above two sysctl descr line styles.

+1 for maxcnt
remove cnt == 0 check

alc added inline comments.May 17 2024, 7:50 PM

sys/amd64/amd64/mp_machdep.c
745 ↗	(On Diff #138660)	Although it makes no functional difference, I would contend that NPTEPG is the correct constant here, as it represents the number of PTEs in a page table page referenced by a 2MB PDE.
751 ↗	(On Diff #138660)	I think multiplication by the same constant that we divided by above makes this easier to understand. I suspect that the compiler will implement this as a shift anyway.

Use NPTEPG

markj accepted this revision.May 20 2024, 2:44 PM

markj added inline comments.

sys/amd64/amd64/mp_machdep.c
739 ↗	(On Diff #138560)	IMO this is a bit confusing. It would be cleaner to keep the global `invlpgb_maxcnt` equal to the "official" value, and handle the bias locally.

This revision is now accepted and ready to land.May 20 2024, 2:44 PM

Use raw maxcnt value

This revision now requires review to proceed.May 20 2024, 4:54 PM

markj accepted this revision.May 20 2024, 4:55 PM

This revision is now accepted and ready to land.May 20 2024, 4:55 PM

alc accepted this revision.May 20 2024, 6:14 PM

Successfully booted an updated version from kib's branch

Jun 25 14:57:48 dut kernel: AMD Extended Feature Extensions ID EBX=0x79bef25f<CLZERO,IRPerf,XSaveErPtr,INVLPGB,RDPRU,BE,WBNOINVD,IBPB,INT_WBINVD,IBRS,STIBP,STIBP_ALWAYSON,PREFER_IBRS,SAMEMODE_IBRS,NOLMSLE,INVLPGBNEST,PPIN,SSBD,CPPC,PSFD,BTC_NO,IBPB_RET>

root@dut:~ # sysctl vm.pmap.invlpgb_works
vm.pmap.invlpgb_works: 1

Fixes from Ed' testing on real hw:

Postpone calculating invlpgb_works until AMD ext features word is initialized
Always truncate VA passed to INVLPGB, the instruction does not accept unaligned addresses
Do not leak pin count
Add comment explaining why TLBSYNC is done on same CPU

This revision now requires review to proceed.Jun 25 2024, 5:08 PM

emaste added inline comments.Jun 25 2024, 6:44 PM

sys/amd64/amd64/mp_machdep.c
688 ↗	(On Diff #140230)	Would it make sense to assert that this is true? I see there's such an assertion in smp_targeted_tlb_shootdown_native, but maybe one here would be good for documentary purposes?

kib added inline comments.Jun 25 2024, 8:11 PM

sys/amd64/amd64/mp_machdep.c
688 ↗	(On Diff #140230)	Yes, indeed. The best place to do that is in fact sched_unpin(), and it is somewhat surprising that there is no such assert already. [The code calls sched_unpin() right after the tlbsync()] D45736

markj added inline comments.Jun 30 2024, 3:42 PM

sys/amd64/amd64/mp_machdep.c
727 ↗	(On Diff #140230)	Suppose `addr1 = 0x800`, `addr2 = 0x1800`. Then I'd expect this routine to invalidate translations for `0x0000 and 0x1000`, but we have `total = 1`.