Page MenuHomeFreeBSD

pmap_invalidate_range: For very large ranges, flush the whole TLB
ClosedPublic

Authored by cem on Nov 26 2015, 5:28 AM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jun 22, 10:21 AM
Unknown Object (File)
May 25 2024, 4:18 AM
Unknown Object (File)
May 2 2024, 11:22 PM
Unknown Object (File)
Mar 8 2024, 12:01 PM
Unknown Object (File)
Jan 21 2024, 7:09 AM
Unknown Object (File)
Jan 17 2024, 10:35 PM
Unknown Object (File)
Jan 9 2024, 1:03 PM
Unknown Object (File)
Dec 23 2023, 4:22 AM
Subscribers

Details

Summary

Typical TLBs have 40-512 entries available. At some point, iterating
every single page in a requested invalidation range and issuing invlpg
on it is more expensive than flushing the TLB and allowing it to reload
on demand.

I've arbitrarily chosen 128 MB of KVA as a hueristic at which point we
flush TLB rather than invalidating every single potential page. Any 128
MB range requires 32 thousand TLB invalidations. This roughly
corresponds to the number of cache lines we're willing to flush
individually when flushing a range, before we dump the entire cache.

Sponsored by: EMC / Isilon Storage Division

Test Plan

Before:

$ time kldload ntb_hw
...
kldload ntb_hw  0.00s user 72.89s system 150% cpu 48.347 total
kldload ntb_hw  0.00s user 57.29s system 144% cpu 39.518 total

After:

$ time kldload ntb_hw
...
kldload ntb_hw  0.00s user 12.87s system 99% cpu 12.890 total
kldload ntb_hw  0.00s user 12.79s system 99% cpu 12.793 total

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

cem retitled this revision from to pmap_invalidate_range: For very large ranges, flush the whole TLB.
cem updated this object.
cem edited the test plan for this revision. (Show Details)
cem added reviewers: alc, kib, jhb, markj.
cem added a subscriber: benno.

Typical second-level TLB size is around 512-1024 entries, and at least on Haswells Intel claims that each entriy can hold either 4K or 2M pte.

I think that more involved heuristic would be more useful there. For the kernel pmap: If the range falls into the DMAP, assume 2M mappings, otherwise assume 4K mappings. For user pmap, assume 4K. Then, fall back to the invalidate_all() if range is covers by 512 or more ptes.

In D4280#90232, @kib wrote:

Typical second-level TLB size is around 512-1024 entries, and at least on Haswells Intel claims that each entriy can hold either 4K or 2M pte.

Yes, but...

I think that more involved heuristic would be more useful there. For the kernel pmap: If the range falls into the DMAP, assume 2M mappings, otherwise assume 4K mappings. For user pmap, assume 4K. Then, fall back to the invalidate_all() if range is covers by 512 or more ptes.

The actual mapping size doesn't matter, because pmap_invalidate_range issues invlpg for every 4k in the region, regardless of mapping size (and it must do so to be correct, I think).

I am happy to change it from 128 MB (32k PTEs) to 2 MB (512 PTEs), if that's what you'd suggest.

In D4280#90310, @cem wrote:
In D4280#90232, @kib wrote:

Typical second-level TLB size is around 512-1024 entries, and at least on Haswells Intel claims that each entriy can hold either 4K or 2M pte.

Yes, but...

I think that more involved heuristic would be more useful there. For the kernel pmap: If the range falls into the DMAP, assume 2M mappings, otherwise assume 4K mappings. For user pmap, assume 4K. Then, fall back to the invalidate_all() if range is covers by 512 or more ptes.

The actual mapping size doesn't matter, because pmap_invalidate_range issues invlpg for every 4k in the region, regardless of mapping size (and it must do so to be correct, I think).

I am happy to change it from 128 MB (32k PTEs) to 2 MB (512 PTEs), if that's what you'd suggest.

pmap_update_pde_invalidate() is a good example of what Kostik is suggesting.

In D4280#90372, @alc wrote:

pmap_update_pde_invalidate() is a good example of what Kostik is suggesting.

I don't understand. In pde_invalidate(), we assume we are promoting or demoting a single, previously valid superpage (or collection of 4k pages). In pmap_invalidate_range, we don't know anything about the current page tables or TLB entries for the given range. We don't know what the mapping used to be, and we're called (at least by pmap_mapdev) after we initialize pagetables for the covered region.

cem edited edge metadata.

Switch heuristic to 512 PTEs; extend to all x86.

sys/amd64/amd64/pmap.c
1447 ↗(On Diff #10555)

This needs to be below the pmap_type_guest() check to avoid breaking bhyve.

sys/amd64/amd64/pmap.c
1447 ↗(On Diff #10555)

I don't think so. See 40 lines down, at the top of pmap_invalidate_all(). The same check is performed there.

sys/i386/i386/pmap.c
1030 ↗(On Diff #10555)

pmap_invalidate_all() on i386 has never flushed PG_G entries, so this wouldn't be correct. (Historically, amd64 was the same, until PCID support was introduced.)

sys/i386/i386/pmap.c
1030 ↗(On Diff #10555)

Is there an alternative routine that does flush PG_G entries? Or, i386 can just be dropped from the patch again.

sys/amd64/amd64/pmap.c
1447 ↗(On Diff #10555)

Ah, ok.

sys/i386/i386/pmap.c
1030 ↗(On Diff #10555)

It's also true for amd64 if PCID is not enabled or not supported (there's a tunable to disable it). invplg is the only way to flush PG_G mappings (which kernel mappings have set) while PGE is set in that case. Alternatively, you can disable PGE in cr4, reload cr3, then re-enable PGE (you might have to re-flush cr3 a second time after enabling PGE in cr4 for it to take effect).

Quoted Text

sys/i386/i386/pmap.c
1030 ↗(On Diff #10555)

John, take another look at the code. In fact, we toggle PGE in cr4 when PCID isn't working or enabled.

sys/i386/i386/pmap.c
1030 ↗(On Diff #10555)

Humm, but only on amd64. i386 could still be "fixed", but it would need something akin to invltlb_globpcid() for the kernel_pmap case? (The "pcid" bit of the name seems a misnomer, perhaps it should just be invltlb_global() since it is also used for the !PCID case on amd64.)

Actually, on amd64 that is not completely true. We install the "plain" INVLTLB IPI handler that only flushes cr3 on a global shootdown, so only the initiating CPU will do the PG_E dance, the other CPUs will not.

sys/i386/i386/pmap.c
1030 ↗(On Diff #10555)
In D4280#90427, @cem wrote:
In D4280#90372, @alc wrote:

pmap_update_pde_invalidate() is a good example of what Kostik is suggesting.

I don't understand. In pde_invalidate(), we assume we are promoting or demoting a single, previously valid superpage (or collection of 4k pages). In pmap_invalidate_range, we don't know anything about the current page tables or TLB entries for the given range. We don't know what the mapping used to be, and we're called (at least by pmap_mapdev) after we initialize pagetables for the covered region.

Kostik (and I) are talking about another call to pmap_invalidate_range() that might also operate on a very large address range: pmap_change_attr() on the DMAP region.

alc edited edge metadata.
In D4280#90232, @kib wrote:

Typical second-level TLB size is around 512-1024 entries, and at least on Haswells Intel claims that each entriy can hold either 4K or 2M pte.

Broadwell's level 2 TLB is even larger. It's 1536 entries. So, a threshold of 512 entries is going to be too small on any of the newer processors. Conrad, I suggest that you go back to a much larger threshold, like you had in the original patch. Otherwise, I think that this patch is ready for committing. There are possible enhancements to how we do invalidations on the direct map, but I don't think that those changes will obviate the desire for this change to pmap_invalidate_range().

This revision is now accepted and ready to land.Dec 5 2015, 8:59 PM
In D4280#90310, @cem wrote:
In D4280#90232, @kib wrote:

Typical second-level TLB size is around 512-1024 entries, and at least on Haswells Intel claims that each entriy can hold either 4K or 2M pte.

Yes, but...

I think that more involved heuristic would be more useful there. For the kernel pmap: If the range falls into the DMAP, assume 2M mappings, otherwise assume 4K mappings. For user pmap, assume 4K. Then, fall back to the invalidate_all() if range is covers by 512 or more ptes.

The actual mapping size doesn't matter, because pmap_invalidate_range issues invlpg for every 4k in the region, regardless of mapping size (and it must do so to be correct, I think).

See section 4.10.4.2 in Volume 3a of the Intel manuals. Specifically, the footnote:

"1. One execution of INVLPG is sufficient even for a page with size greater than 4 KBytes."

P.S. The reason that I backed away from the suggestion to use invlpg for the PG_PS mappings in the pmap_set_pg() patch is that the first 1 MB of physical memory is handled specially by the MMU because of the fixed MTRRs. Essentially, the PG_PS flag is ignored when filling the TLB on a mapping that includes the first 1 MB of physical memory; only 4KB mappings are stored in the TLB. Normally, we load the kernel above 2 or 4MB, depending on the use of PAE, but someone could reconfigure that.

In D4280#92690, @alc wrote:

P.S. The reason that I backed away from the suggestion to use invlpg for the PG_PS mappings in the pmap_set_pg() patch is that the first 1 MB of physical memory is handled specially by the MMU because of the fixed MTRRs. Essentially, the PG_PS flag is ignored when filling the TLB on a mapping that includes the first 1 MB of physical memory; only 4KB mappings are stored in the TLB. Normally, we load the kernel above 2 or 4MB, depending on the use of PAE, but someone could reconfigure that.

Is this documented anywhere ? FWIW, it should not matter due to the specification note you mentioned elsewere. Also, I did 4K page invalidations in the D4346 (i.e. I incremented the va, to which inlvpg was applied, by PAGE_SIZE, even for the pde-mapped pages).

In D4280#92683, @alc wrote:
In D4280#90232, @kib wrote:

Typical second-level TLB size is around 512-1024 entries, and at least on Haswells Intel claims that each entriy can hold either 4K or 2M pte.

Broadwell's level 2 TLB is even larger. It's 1536 entries. So, a threshold of 512 entries is going to be too small on any of the newer processors. Conrad, I suggest that you go back to a much larger threshold, like you had in the original patch. Otherwise, I think that this patch is ready for committing. There are possible enhancements to how we do invalidations on the direct map, but I don't think that those changes will obviate the desire for this change to pmap_invalidate_range().

Ok.

In D4280#92690, @alc wrote:
In D4280#90310, @cem wrote:
In D4280#90232, @kib wrote:

I think that more involved heuristic would be more useful there. For the kernel pmap: If the range falls into the DMAP, assume 2M mappings, otherwise assume 4K mappings. For user pmap, assume 4K. Then, fall back to the invalidate_all() if range is covers by 512 or more ptes.

The actual mapping size doesn't matter, because pmap_invalidate_range issues invlpg for every 4k in the region, regardless of mapping size (and it must do so to be correct, I think).

See section 4.10.4.2 in Volume 3a of the Intel manuals. Specifically, the footnote:

"1. One execution of INVLPG is sufficient even for a page with size greater than 4 KBytes."

Sure, if we know any existing mapping in that range must have been the single PG_PS page. If we aren't sure, we have to invalidate all possible 4k mappings (which will of course invlpg the larger page as a side effect).

This revision was automatically updated to reflect the committed changes.
In D4280#92816, @kib wrote:
In D4280#92690, @alc wrote:

P.S. The reason that I backed away from the suggestion to use invlpg for the PG_PS mappings in the pmap_set_pg() patch is that the first 1 MB of physical memory is handled specially by the MMU because of the fixed MTRRs. Essentially, the PG_PS flag is ignored when filling the TLB on a mapping that includes the first 1 MB of physical memory; only 4KB mappings are stored in the TLB. Normally, we load the kernel above 2 or 4MB, depending on the use of PAE, but someone could reconfigure that.

Is this documented anywhere ? FWIW, it should not matter due to the specification note you mentioned elsewere. Also, I did 4K page invalidations in the D4346 (i.e. I incremented the va, to which inlvpg was applied, by PAGE_SIZE, even for the pde-mapped pages).

Section 11.11.9 of Volume 3A. It's because memory type information is kept in the TLB entry.

Yes, in principle, a single invlpg on a PG_PS mapping to the first 1 MB of physical memory should suffice. However, I don't think anyone has relied on this working because of errata in long-ago processors.

The more interesting scenario would be 1 GB page mappings in, I believe, Westmere, or whatever was the first Intel processor to "support" 1 GB pages. The TLB didn't actually support 1 GB page mappings, so a 1 GB page mapping in the page table got converted into a 2 MB TLB entry. I've never looked to see if there are any errata concerning the use of a single invlpg in this case.

In D4280#92951, @alc wrote:

Section 11.11.9 of Volume 3A. It's because memory type information is kept in the TLB entry.

The section is rather vague. I understand the intent, Intel requires that the PAT types do not contradict to fixed MTRR in the low 1M, scaring by the undefined behaviour otherwise. OTOH, they claim that the situation is handled. And, I believe that it must be handled, otherwise numerous BIOS bugs would be much more visible.

Yes, in principle, a single invlpg on a PG_PS mapping to the first 1 MB of physical memory should suffice. However, I don't think anyone has relied on this working because of errata in long-ago processors.

The more interesting scenario would be 1 GB page mappings in, I believe, Westmere, or whatever was the first Intel processor to "support" 1 GB pages. The TLB didn't actually support 1 GB page mappings, so a 1 GB page mapping in the page table got converted into a 2 MB TLB entry. I've never looked to see if there are any errata concerning the use of a single invlpg in this case.

I just checked public errata list for Xeons 5600. There is no mention of 1G pages at all.

In D4280#93168, @kib wrote:
In D4280#92951, @alc wrote:

Section 11.11.9 of Volume 3A. It's because memory type information is kept in the TLB entry.

The section is rather vague. I understand the intent, Intel requires that the PAT types do not contradict to fixed MTRR in the low 1M, scaring by the undefined behaviour otherwise. OTOH, they claim that the situation is handled. And, I believe that it must be handled, otherwise numerous BIOS bugs would be much more visible.

My belief is that AMD and Intel store the so-called "effective memory type", i.e., the memory type that results from combining the PAT and MTRR settings, in the TLB entry. And, so, a superpage mapping in the TLB would have the MTRR setting corresponding to the first access to physical memory through the superpage mapping.

Yes, in principle, a single invlpg on a PG_PS mapping to the first 1 MB of physical memory should suffice. However, I don't think anyone has relied on this working because of errata in long-ago processors.

The more interesting scenario would be 1 GB page mappings in, I believe, Westmere, or whatever was the first Intel processor to "support" 1 GB pages. The TLB didn't actually support 1 GB page mappings, so a 1 GB page mapping in the page table got converted into a 2 MB TLB entry. I've never looked to see if there are any errata concerning the use of a single invlpg in this case.

I just checked public errata list for Xeons 5600. There is no mention of 1G pages at all.

In D4280#92951, @alc wrote:

Yes, in principle, a single invlpg on a PG_PS mapping to the first 1 MB of physical memory should suffice. However, I don't think anyone has relied on this working because of errata in long-ago processors.

I found this errata for the Core M/Celeron M CPUs. They are quite old but not ancient, apparently people still use machines with such processors.

W93.      INVLPG Operation for Large (2M/4M) Pages May Be
          Incomplete under Certain Conditions

Problem:  The INVLPG instruction may not completely invalidate Translation Look-
          aside Buffer (TLB) entries for large pages (2M/4M) when both of the
          following conditions exist:

· Address range of the page being invalidated spans several Memory Type Range
   Registers (MTRRs) with different memory types specified

· INVLPG operation is preceded by a Page Assist Event (Page Fault (#PF) or an
   access that results in either A or D bits being set in a Page Table Entry (PTE))

Implication: Stale translations may remain valid in TLB after a PTE update resulting
                      in unpredictable system behavior. Intel has not observed this erratum
                      with any commercially available software.

          Workaround: Software should ensure that the memory type specified in the MTRRs is the same for
             the entire address range of the large page.