Page MenuHomeFreeBSD

minidump: reduce the amount direct accesses to page tables
ClosedPublic

Authored by mhorne on Sep 16 2021, 6:39 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Oct 5, 7:43 PM
Unknown Object (File)
Thu, Oct 3, 7:16 AM
Unknown Object (File)
Thu, Oct 3, 12:34 AM
Unknown Object (File)
Wed, Sep 25, 5:28 AM
Unknown Object (File)
Mon, Sep 16, 8:07 AM
Unknown Object (File)
Sep 3 2024, 11:46 PM
Unknown Object (File)
Aug 31 2024, 6:28 AM
Unknown Object (File)
Aug 31 2024, 6:28 AM

Details

Summary

During a live dump, we may race with updates to the kernel page tables.
This is generally okay; we accept that the state of the system while
dumping may be somewhat inconsistent with its state when the dump was
invoked. However, when walking the kernel page tables, it is important
that we load each PDE/PTE only once while operating on it. Otherwise, it
is possible to have the relevant PTE change underneath us. For example,
after checking the valid bit, but before reading the physical address.
This makes it safe to walk the page tables without a lock, since we will
at worst read outdated but consistent information.

Similarly, don't read kernel_vm_end more than once, on the off chance
that pmap_growkernel() is called between the two page table walks.

Test Plan

To be tinderboxed. I still need to handle powerpc and mips (maybe).

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 42792
Build 39680: arc lint + arc unit

Event Timeline

sys/amd64/amd64/minidump_machdep.c
199–200

There is no guarantee in general that the compiled code will load PTEs only once. atomic_load_* is needed, I believe.

sys/amd64/amd64/minidump_machdep.c
199–200

Right, I should have realized this. Will update shortly.

I suspect this is too unsafe. There are situations where we do unmap something from KVA, and just trying to follow freed pte might result in e.g. attempting to access physical memory besides DMAP range, or generate non-canonical address, or whatever the specific architecture does not tolerate.

In D31990#721752, @kib wrote:

I suspect this is too unsafe. There are situations where we do unmap something from KVA, and just trying to follow freed pte might result in e.g. attempting to access physical memory besides DMAP range, or generate non-canonical address, or whatever the specific architecture does not tolerate.

Isn't this making the same assumptions that sysctl_kmaps() already does?

In D31990#721752, @kib wrote:

I suspect this is too unsafe. There are situations where we do unmap something from KVA, and just trying to follow freed pte might result in e.g. attempting to access physical memory besides DMAP range, or generate non-canonical address, or whatever the specific architecture does not tolerate.

Isn't this making the same assumptions that sysctl_kmaps() already does?

I think not quite. For instance on amd64 I am almost sure that we never free kernel page directory page, but under the right condition we might free page table page. Now, sysctl_kmaps() at worst would iterate over ptes in the freed and reused page table page, dumping nonsense. In contrast, dump would try to access the data pages pointed by the non-pte ptes.

In D31990#721777, @kib wrote:

For instance on amd64 I am almost sure that we never free kernel page directory page, but under the right condition we might free page table page.

Where might it happen? amd64 pmap_demote_pte_locked() assumes otherwise:

5928                 /*                                                                                                                                      
5929                  * If the page table page is missing and the mapping                                                                                    
5930                  * is for a kernel address, the mapping must belong to                                                                                  
5931                  * the direct map.  Page table pages are preallocated                                                                                   
5932                  * for every other part of the kernel address space,                                                                                    
5933                  * so the direct map region is the only part of the                                                                                     
5934                  * kernel address space that must be handled here.                                                                                      
5935                  */                                                                                                                                     
5936                 KASSERT(!in_kernel || (va >= DMAP_MIN_ADDRESS &&                                                                                        
5937                     va < DMAP_MAX_ADDRESS),                                                                                                             
5938                     ("pmap_demote_pde: No saved mpte for va %#lx", va));

Similarly pmap_enter(va > VM_MAXUSER_ADDRESS) panics if it fails to find a valid PDE for the fault address.

Now, sysctl_kmaps() at worst would iterate over ptes in the freed and reused page table page, dumping nonsense. In contrast, dump would try to access the data pages pointed by the non-pte ptes.

This still could be an undesirable info leak.

In D31990#721777, @kib wrote:

For instance on amd64 I am almost sure that we never free kernel page directory page, but under the right condition we might free page table page.

Where might it happen?

For instance, pmap_remove() over managed mappings would free page table pages, and we do have some managed mappings, like pipe buffers.

I believe that code that walks amd64 page table on live kernel should check that computed physical addresses belong to some of dump_avail[] segments, and also that they fit into DMAP, before trying to interpret the address as either next level paging structure, or the page.

In D31990#722178, @kib wrote:
In D31990#721777, @kib wrote:

For instance on amd64 I am almost sure that we never free kernel page directory page, but under the right condition we might free page table page.

Where might it happen?

For instance, pmap_remove() over managed mappings would free page table pages, and we do have some managed mappings, like pipe buffers.

I still do not see how. For PTEs, pmap_remove_pte() does not modify the page table directly but calls pmap_unuse_pt(), which is a nop for kernel mappings.

I believe that code that walks amd64 page table on live kernel should check that computed physical addresses belong to some of dump_avail[] segments, and also that they fit into DMAP, before trying to interpret the address as either next level paging structure, or the page.

I think this is reasonable.

Handle review comments:

  • Make loads atomic
  • Sanity check the physical addresses of lowest-level PTEs before attempting to dump their contents

Thank you both for the discussion. I've studied the riscv and arm64 pmaps and it looks like the same assumptions hold - we will never free any PDE belonging to KVA - so I've applied the same checks on these platforms. From what I could see, arm and i386 should be safe as we only zero their PTEs.

For powerpc I will probably just disallow live dumps for the time being.

This revision is now accepted and ready to land.Nov 12 2021, 11:15 PM
sys/i386/i386/minidump_machdep_base.c
283–284

Should it be pte_load(&pd[va >> PDRSHIFT])? Same above.

This revision now requires review to proceed.Nov 15 2021, 9:05 PM
mhorne added inline comments.
sys/i386/i386/minidump_machdep_base.c
283–284

Yes, thank you. I am running a tinderbox for these patches now which also tripped on this.

This revision is now accepted and ready to land.Nov 15 2021, 9:08 PM
This revision was automatically updated to reflect the committed changes.
mhorne marked an inline comment as done.