Page MenuHomeFreeBSD

Avoid reclaiming the PV entry for a VA when updating the VA's PTE.
AbandonedPublic

Authored by markj on Jun 19 2018, 7:44 PM.
Tags
None
Referenced Files
F106017390: D15910.id.diff
Mon, Dec 23, 11:42 PM
Unknown Object (File)
Thu, Nov 28, 9:53 AM
Unknown Object (File)
Nov 16 2024, 9:35 PM
Unknown Object (File)
Nov 15 2024, 3:01 PM
Unknown Object (File)
Sep 27 2024, 11:38 PM
Unknown Object (File)
Sep 24 2024, 11:17 AM
Unknown Object (File)
Sep 22 2024, 10:30 PM
Unknown Object (File)
Sep 20 2024, 1:36 PM
Subscribers
None

Details

Reviewers
alc
kib
Summary

When pmap_enter() calls get_pv_entry(), a PV chunk belonging to current
pmap might be reclaimed. If pmap_enter() is updating an existing PTE,
there is nothing preventing reclamation of the PV chunk corresponding to
that PTE, a situation that pmap_enter() doesn't handle. In this case,
because pmap_enter() releases its extra wiring of the PTE's page table
page before calling get_pv_entry(), we can get a use-after-free of the
page table page if the PV chunk reclamation removes all PTEs from that
page.

We would seem to have a similar problem for PDEs, but reclaim_pv_chunk()
currently does not reclaim PV entries for superpage mappings. However,
the comment above reclaim_pv_chunk() gives a different reason for this
choice.

Fix the problem by explicitly passing a "skip VA" to reclaim_pv_chunk().
A PV entry for this VA in locked_pmap does not get reclaimed, avoiding
the problem. As noted above, this is currently not strictly needed for
superpage mappings, but do it any for consistency and robustness.

Test Plan

Peter found the problem and is in the process of validating this patch.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 17453
Build 17284: arc lint + arc unit

Event Timeline

markj edited the test plan for this revision. (Show Details)
markj added reviewers: alc, kib.
This revision is now accepted and ready to land.Jun 19 2018, 8:48 PM

I have two comments.

  1. The change to reserve_pv_entries() is unnecessary. That function is only called during demotion. At that point, the only PV entry will be for the superpage mapping, and reclaim_pv_chunk() already skips superpage mappings.
  2. In regards to pmap_enter(), we should aim to kill two birds with one stone. Recall the copy-on-write mapping bug that Kostik worked around in vm_fault(). I say worked around because the root cause is here in pmap_enter(). When the physical page mapped at va is changing, pmap_enter() should destroy the old mapping before creating the new one. Once pmap_enter() is restructured in this way, you can simply recycle the old mapping's PV entry.
In D15910#337582, @alc wrote:

I have two comments.

  1. The change to reserve_pv_entries() is unnecessary. That function is only called during demotion. At that point, the only PV entry will be for the superpage mapping, and reclaim_pv_chunk() already skips superpage mappings.

Indeed, I noted this in the review description. The comment above reclaim_pv_chunk() suggests that it skips superpage mappings only to avoid worsening an ongoing PV entry shortage, but in fact this is required for correctness of the code. To me it seemed fragile to leave reserve_pv_entries() as it was since the handling of reclaim_pv_chunk() wrt superpage mappings might change in the future (e.g., by searching for and reclaiming all 4KB page mappings when a superpage mapping is discovered) and introduce this subtle bug.

  1. In regards to pmap_enter(), we should aim to kill two birds with one stone. Recall the copy-on-write mapping bug that Kostik worked around in vm_fault(). I say worked around because the root cause is here in pmap_enter(). When the physical page mapped at va is changing, pmap_enter() should destroy the old mapping before creating the new one. Once pmap_enter() is restructured in this way, you can simply recycle the old mapping's PV entry.

Ok, I will explore that approach.

In D15910#337582, @alc wrote:

I have two comments.

  1. The change to reserve_pv_entries() is unnecessary. That function is only called during demotion. At that point, the only PV entry will be for the superpage mapping, and reclaim_pv_chunk() already skips superpage mappings.

Indeed, I noted this in the review description. The comment above reclaim_pv_chunk() suggests that it skips superpage mappings only to avoid worsening an ongoing PV entry shortage, but in fact this is required for correctness of the code. To me it seemed fragile to leave reserve_pv_entries() as it was since the handling of reclaim_pv_chunk() wrt superpage mappings might change in the future (e.g., by searching for and reclaiming all 4KB page mappings when a superpage mapping is discovered) and introduce this subtle bug.

I can't foresee the behavior of reclaim_pv_chunk() changing in respect to superpage mappings. I'm perfectly happy to see the comment changed to say that the behavior is mandatory. In a hypothetical future where we supported 1GB pages on x86 or 64KB pages on arm, I might want to add a flag to reclaim_pv_chunk() indicating the maximum page size on which reclaims are permitted.

  1. In regards to pmap_enter(), we should aim to kill two birds with one stone. Recall the copy-on-write mapping bug that Kostik worked around in vm_fault(). I say worked around because the root cause is here in pmap_enter(). When the physical page mapped at va is changing, pmap_enter() should destroy the old mapping before creating the new one. Once pmap_enter() is restructured in this way, you can simply recycle the old mapping's PV entry.

Ok, I will explore that approach.

Thanks.

In D15910#337582, @alc wrote:
  1. In regards to pmap_enter(), we should aim to kill two birds with one stone. Recall the copy-on-write mapping bug that Kostik worked around in vm_fault(). I say worked around because the root cause is here in pmap_enter(). When the physical page mapped at va is changing, pmap_enter() should destroy the old mapping before creating the new one. Once pmap_enter() is restructured in this way, you can simply recycle the old mapping's PV entry.

Can you elaborate more, please ? What do you mean by destroying the old mapping ? In particular, do you mean installing the pte with clear PG_V into the changing PTE ?

In D15910#337632, @kib wrote:
In D15910#337582, @alc wrote:
  1. In regards to pmap_enter(), we should aim to kill two birds with one stone. Recall the copy-on-write mapping bug that Kostik worked around in vm_fault(). I say worked around because the root cause is here in pmap_enter(). When the physical page mapped at va is changing, pmap_enter() should destroy the old mapping before creating the new one. Once pmap_enter() is restructured in this way, you can simply recycle the old mapping's PV entry.

Can you elaborate more, please ? What do you mean by destroying the old mapping ? In particular, do you mean installing the pte with clear PG_V into the changing PTE ?

Yes, I mean destroying the PTE, including TLB shootdown. In effect, briefly the PTE will be 0. Then, installing the new PTE is just a pte_store(). All of the stuff that we currently perform under "if ((origpte & PG_V) != 0) {" will already have been performed when we destroyed the PTE.

In D15910#338067, @alc wrote:
In D15910#337632, @kib wrote:
In D15910#337582, @alc wrote:
  1. In regards to pmap_enter(), we should aim to kill two birds with one stone. Recall the copy-on-write mapping bug that Kostik worked around in vm_fault(). I say worked around because the root cause is here in pmap_enter(). When the physical page mapped at va is changing, pmap_enter() should destroy the old mapping before creating the new one. Once pmap_enter() is restructured in this way, you can simply recycle the old mapping's PV entry.

Can you elaborate more, please ? What do you mean by destroying the old mapping ? In particular, do you mean installing the pte with clear PG_V into the changing PTE ?

Yes, I mean destroying the PTE, including TLB shootdown. In effect, briefly the PTE will be 0. Then, installing the new PTE is just a pte_store(). All of the stuff that we currently perform under "if ((origpte & PG_V) != 0) {" will already have been performed when we destroyed the PTE.

Isn't this invalid if both the old and new mappings are wired? Or should it not be possible

Isn't this invalid if both the old and new mappings are wired? Or should it not be possible

... for that situation to arise when the mappings are to different physical pages?

In D15910#338067, @alc wrote:
In D15910#337632, @kib wrote:
In D15910#337582, @alc wrote:
  1. In regards to pmap_enter(), we should aim to kill two birds with one stone. Recall the copy-on-write mapping bug that Kostik worked around in vm_fault(). I say worked around because the root cause is here in pmap_enter(). When the physical page mapped at va is changing, pmap_enter() should destroy the old mapping before creating the new one. Once pmap_enter() is restructured in this way, you can simply recycle the old mapping's PV entry.

Can you elaborate more, please ? What do you mean by destroying the old mapping ? In particular, do you mean installing the pte with clear PG_V into the changing PTE ?

Yes, I mean destroying the PTE, including TLB shootdown. In effect, briefly the PTE will be 0. Then, installing the new PTE is just a pte_store(). All of the stuff that we currently perform under "if ((origpte & PG_V) != 0) {" will already have been performed when we destroyed the PTE.

Then other thread might observe the zero PTE and got a page fault when accessing a valid mapping ? I do not see why this is impossible.

Isn't this invalid if both the old and new mappings are wired? Or should it not be possible

... for that situation to arise when the mappings are to different physical pages?

To be clear, I'm assuming that the problem that Kostik is worried about is having "spurious" faults on mlock()ed memory. However, when we wire mappings (and the underlying physical pages) during mlock(), we simulate the COW faults. So, in the normal case of mlock(), I assert that there is not a problem. I speculate that setting a breakpoint in the code of an mlockall()ed application might be the one scenario where a temporarily zeroed PTE could occur. Do I need to argue that a spurious page fault in that scenario isn't really of concern. :-)

I would also like to point out that the work-around in vm_fault() introduces a second page fault to enable write access after the initial COW fault, even in a single-threaded application where there was never an issue. Handling the COW problem in pmap_enter(), eliminates that extra page. On the other hand, it opens up the possibility of a spurious page fault in multithreaded applications on the temporarily zeroed PTE. However, that would only occur in the rare scenarios described in the original bug report. So, I think that this is overall the right solution from a performance. We're making the common case cheaper, i.e., only one page fault, and limiting the extra page fault to the rare case.

And clearly, we wait until all of the pmap implementations have been checked before undoing the workaround from vm_fault(). For example, arm{,64} with its break-before-make requirement may not have the bug.

That said, I'll mention another possible pmap-level approach, and that is to "push down" the PTE change inside of pmap_invalidate_page(). However, we would have to change the shootdown code to be more like rendezvous where the initiator waits for all of the other cores to "idle", then perform the PTE update on the initiator, flush the TLB entry everywhere, release the other cores not returning from the initiator until it is known that the flushes have been performed everywhere. This code would also need to return the old PTE value to the initiator. However, I haven't advocated this approach because it increases the delay in the shootdown code across the system.

In D15910#338113, @alc wrote:

Isn't this invalid if both the old and new mappings are wired? Or should it not be possible

... for that situation to arise when the mappings are to different physical pages?

To be clear, I'm assuming that the problem that Kostik is worried about is having "spurious" faults on mlock()ed memory. However, when we wire mappings (and the underlying physical pages) during mlock(), we simulate the COW faults. So, in the normal case of mlock(), I assert that there is not a problem. I speculate that setting a breakpoint in the code of an mlockall()ed application might be the one scenario where a temporarily zeroed PTE could occur. Do I need to argue that a spurious page fault in that scenario isn't really of concern. :-)

And for completeness, in the case of a fork of an mlockall()ed application, we preemptively copy all writeable data. (During the fork, other threads are paused from executing, right?)

In D15910#338114, @alc wrote:
In D15910#338113, @alc wrote:

Isn't this invalid if both the old and new mappings are wired? Or should it not be possible

... for that situation to arise when the mappings are to different physical pages?

To be clear, I'm assuming that the problem that Kostik is worried about is having "spurious" faults on mlock()ed memory. However, when we wire mappings (and the underlying physical pages) during mlock(), we simulate the COW faults. So, in the normal case of mlock(), I assert that there is not a problem. I speculate that setting a breakpoint in the code of an mlockall()ed application might be the one scenario where a temporarily zeroed PTE could occur. Do I need to argue that a spurious page fault in that scenario isn't really of concern. :-)

And for completeness, in the case of a fork of an mlockall()ed application, we preemptively copy all writeable data. (During the fork, other threads are paused from executing, right?)

No, the threads are not stopped.

This was superseded by a different change, D16005.