amd64: fix INVLPGB range invalidation
ClosedPublic
Actions

Authored by kevans on Fri, Apr 17, 1:24 AM.

Details

Reviewers

markj
alc
kib

Commits

rG1b8e5c02f5c0: amd64: fix INVLPGB range invalidation

Summary

AMD64 Architecture Programmer's Manual Volume 3 says the following:

ECX[15:0] contains a count of the number of sequential pages to
invalidate in addition to the original virtual address, starting from
the virtual address specified in rAX. A count of 0 invalidates a
single page. ECX[31]=0 indicates to increment the virtual address at
the 4K boundary. ECX[31]=1 indicates to increment the virtual address
at the 2M boundary. The maximum count supported is reported in
CPUID function 8000_0008h, EDX[15:0].

ECX[31] being what we call INVLPGB_2M_CNT, signaling to increment the
VA by 2M.

This instruction invalidates the TLB entry or entries, regardless of
the page size (4 Kbytes, 2 Mbytes, 4 Mbytes, or 1 Gbyte). [...]

Combined with this, my interpretation of the current code is: if
<va> is aligned on a PDE boundary, we'll use INVLPGB_2M_CNT to try and
invalidate <cnt> PDEs with a single call, but that only works if <va> is
the start of at least <cnt> 2M pages. Otherwise, if <va> or any of the
subsequent PDEs isn't actually a superpage, then we would actually only
invalidate the *first* page within the PDE before skipping to the next
PDE, leaving the remainder of the 4K pages in between as they were.

The implication would seem to be that we would need to inspect the range
that we're trying to invalidate if we're planning on using
INVLPGB_2M_CNT at all, so this patch just simplifies it to a series of
4K invalidations. My gut feeling is that we likely still come out on
top vs. the TLB shootdown we're avoiding.

This seems to explain some issues we've seen lately with fdgrowtable()
and kqueue on recent Zen4/Zen5 EPYC hardware, where we'd experience
corruption that we can't explain.

PR: 293382

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

kevans created this revision.Fri, Apr 17, 1:24 AM

Herald added a subscriber: imp. · View Herald TranscriptFri, Apr 17, 1:24 AM

kevans requested review of this revision.Fri, Apr 17, 1:24 AM

Harbormaster completed remote builds in B72273: Diff 175727.Fri, Apr 17, 1:24 AM

To be clear: I'm definitely not asserting that my interpretation of the language here is correct. In particular, I'm not confident that the second paragraph quoted means that it will only invalidate a 4K page if <va> isn't a superpage; I think the verbiage about 2M increment is clear, albeit a little verbose.

kib accepted this revision.Fri, Apr 17, 2:09 AM

kib added inline comments.

sys/amd64/amd64/mp_machdep.c
743	I think a one-line comment that we always do page increments because '....' is due there.

This revision is now accepted and ready to land.Fri, Apr 17, 2:09 AM

kevans added inline comments.Fri, Apr 17, 2:27 AM

sys/amd64/amd64/mp_machdep.c
743	My proposal to keep it at 80 columns: /* 4K increments because these may not be superpages. */ Presumably that's a good enough hint to look at the commit message if you want more detail

kib added inline comments.Fri, Apr 17, 2:32 AM

sys/amd64/amd64/mp_machdep.c
743	Sounds good.

I'm going to plan to commit this within the next day or so. I think linux's use of "stride" instead of size or shift for their naming of the bit is very telling, and their general range invalidation function seems to specifically just use PTE stride as well.

markj accepted this revision.Mon, Apr 20, 3:00 PM

In D56458#1293651, @kevans wrote:

I'm going to plan to commit this within the next day or so. I think linux's use of "stride" instead of size or shift for their naming of the bit is very telling, and their general range invalidation function seems to specifically just use PTE stride as well.

Yes, the implementation of invlpgb_kernel_range_flush provides compelling evidence.

Closed by commit rG1b8e5c02f5c0: amd64: fix INVLPGB range invalidation (authored by kevans). · Explain WhyMon, Apr 20, 8:18 PM

This revision was automatically updated to reflect the committed changes.

kevans added a commit: rG1b8e5c02f5c0: amd64: fix INVLPGB range invalidation.