X86 pmap_qenter needs to always invalidate
AbandonedPublic
Actions

Authored by bz on Mar 20 2017, 8:12 PM.

Details

Reviewers

alc
kib

Summary

Running FreeBSD X86(_64) under gem5 we have seen

"vm_fault: fault on nofault entry, addr: ..."

panics regularly (and over time in different places).

Analysing what happened we found that with the out-of-order CPU
model we would kick off the page table walker and cache a zero pte
entry in the walker cache.
The page table walker had no insight into the updated pte value in
the register file of the CPU at that time and this would happen
before the store from pmap_qenter changing the pte was committed
and hence visible on the memory side of the CPU.

The reason we do not seem to see this problem on (most) hardware
is that according to [1] mos CPUs seem to implement a stronger
coherence guarantees than the specifications demand.

Intel's SDM [2] states (the end of 4.10.3.1 Caches for Paging Structures):

"The processor may create entries in paging-structure caches for
translations required for prefetches and for accesses that are a
result of speculative execution that would never actually occur
in the executed code path.
..
Because the processor may create the cache entries at the time of
translation and not update them following subsequent modifications
to the paging structures in memory, software should take care to
invalidate the cache entries appropriately when causing such modifications."

AMD's ArchPM Vol 2: System Programming [3] describes this case in:
7.3.1 Special Coherency Considerations.

In order to not rely on non-guaranteed behaviour, remove the
optimization to only invalidate the range if any of the previous
pte-s was a valid mapping. With this FreeBSD runs properly on
gem5 and likely also better on certain (AMD) CPUs.

[1] http://blog.stuffedcow.net/2015/08/pagewalk-coherence/#coherence
[2] https://software.intel.com/sites/default/files/managed/a4/60/325384-sdm-vol-3abcd.pdf
[3] http://support.amd.com/TechDocs/24593.pdf

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 8165
Build 8387: CI src build	Jenkins
Build 8386: arc lint + arc unit

Event Timeline

bz created this revision.Mar 20 2017, 8:12 PM

Herald added a subscriber: imp. · View Herald TranscriptMar 20 2017, 8:12 PM

Add Robert to Cc:

Analysing what happened we found that with the out-of-order CPU model we would kick off the page table walker and cache a zero pte entry in the walker cache.

This is clearly the bug in the emulator. According to SDM 4.10.2.3 Details of TLB Use:

Because the TLBs cache entries only for linear addresses with translations, there can be a TLB entry for a page
number only if the P flag is 1 and the reserved bits are 0 in each of the paging-structure entries used to translate
that page number. In addition, the processor does not cache a translation for a page number unless the accessed
flag is 1 in each of the paging-structure entries used during translation; before caching a translation, the processor
sets any of these accessed flags that is not already 1.

In other words, zero PTEs (which are invalid) are not allowed to be cached by the architecture and do not need an invalidation.

@kib ok, this may be a secondary problem; change my sentence to "Analysing what happened we found that with the out-of-order CPU model we would kick off the page table walker and find the 0 pte entry." The problem remains that the store has not been committed yet and a speculative walk will only see the old (zero) pte.

Harbormaster completed remote builds in B8165: Diff 26467.Mar 20 2017, 8:51 PM

In D10067#208210, @bz wrote:

we would kick off the page table walker and find the 0 pte entry." The problem remains that the store has not been committed yet and a speculative walk will only see the old (zero) pte.

Does this happen on the same CPU which did the pte_store() ? If yes, this is again an emulator bug: the page walks must be coherent, in particular, they must be able to see the content of the store buffers on the local processor (AKA store forwarding).

If it is another CPU which sees zero pte after pmap_qenter(), then it is legitimate machine behavior, but just means that there is a race and code would behave the same as if pmap_qenter() did not yet executed the pte_store() at all. There must be external facilities (like locks) which ensure that other threads does not access mappings until our thread finished setting it up. Can you provide the backtraces for pmap_qenter() thread and the raced thread, if the issue is caused by a race ?

This is on a single CPU. Can you please give me a reference for "the page walks must be coherent" as my understanding from the cited pages and the blog post referenced is that they must not be.

In D10067#208217, @bz wrote:

This is on a single CPU. Can you please give me a reference for "the page walks must be coherent" as my understanding from the cited pages and the blog post referenced is that they must not be.

I am not sure which blog post you mean, could you please provide the url ? If you mean the situation explained e.g. in SDM 11.7 IMPLICIT CACHING, then it is not applicable because the previous pte entry is invalid.

For the coherence of the page walks, this is the generic rule that the CPU accesses, unless documented contrary, must obey the caching policy specified on the given memory address. Store buffers are only visible for specific situations explicitly mentioned in the specification, all invoking more than one CPU to happen.

See reference [1]

In D10067#208488, @bz wrote:

See reference [1]

This is 11.7 IMPLICIT CACHING.

Can we close this?

Close for now; still need to track down but ETIME currently.

Revision Contents
Changeset List

Path

Size

sys/

amd64/

pmap.c

12 lines

i386/

pmap.c

12 lines

Diff 26467

View Options

sys/amd64/amd64/pmap.c