Page MenuHomeFreeBSD

amd64: ensure that curproc->p_vmspace pmap always matches PCPU curpmap.
ClosedPublic

Authored by kib on Aug 7 2018, 11:37 PM.

Details

Summary

When performing context switch on a machine without PCID, if current %cr3 equals to the new pmap %cr3, which is typical for kernel_pmap vs. kernel process, I overlooked to update PCPU curpmap value.

Reported by: eadler, ambrisko
Discussed with: kevans
Tested by: ambrisko

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

This revision is now accepted and ready to land.Aug 8 2018, 12:54 AM

Isn't the root cause of this problem that process 0 shares its page table with the kernel pmap, but it doesn't use the kernel pmap as its pmap, i.e., the process 0 vm_map doesn't use the kernel pmap as its pmap? Remember every vm_map has a pointer to a pmap, and this pointer needn't point to the pmap storage within the same vmspace.

Index: amd64/amd64/pmap.c
===================================================================
--- amd64/amd64/pmap.c  (revision 337342)
+++ amd64/amd64/pmap.c  (working copy)
@@ -2655,6 +2655,7 @@ pmap_pinit0(pmap_t pmap)
                        __pcpu[i].pc_ucr3 = PMAP_NO_CR3;
                }
        }
+       printf("pmap: %p / kernel_pmap: %p\n", pmap, kernel_pmap);
        PCPU_SET(curpmap, kernel_pmap);
        pmap_activate(curthread);
        CPU_FILL(&kernel_pmap->pm_active);

and this prints:

pmap: 0xffffffff820010b8 / kernel_pmap: 0xffffffff821394f8
This comment was removed by kevans.
In D16618#353173, @alc wrote:

Isn't the root cause of this problem that process 0 shares its page table with the kernel pmap, but it doesn't use the kernel pmap as its pmap, i.e., the process 0 vm_map doesn't use the kernel pmap as its pmap? Remember every vm_map has a pointer to a pmap, and this pointer needn't point to the pmap storage within the same vmspace.

Index: amd64/amd64/pmap.c
===================================================================
--- amd64/amd64/pmap.c  (revision 337342)
+++ amd64/amd64/pmap.c  (working copy)
@@ -2655,6 +2655,7 @@ pmap_pinit0(pmap_t pmap)
                        __pcpu[i].pc_ucr3 = PMAP_NO_CR3;
                }
        }
+       printf("pmap: %p / kernel_pmap: %p\n", pmap, kernel_pmap);
        PCPU_SET(curpmap, kernel_pmap);
        pmap_activate(curthread);
        CPU_FILL(&kernel_pmap->pm_active);

and this prints:

pmap: 0xffffffff820010b8 / kernel_pmap: 0xffffffff821394f8

Yes, the problem is that, I tried to mention it by stating 'kernel pmap vs. kernel process'. But I do not see what else does it imply.

I considered changing kernel_pmap to point to the vmspace0 pmap on amd64, but decided to not do it. For instance, I can imagine that we might want to use the kernel page table for kernel processes other than proc0.

More, I think that it is mostly pointless to try to enforce the invariant (pmap1 != pmap2) => (pmap1->pm_cr3 != pmap2->pm_cr3). It is siimple enough and not costly to handle the missed case of switching pmaps which use the same page tables.

In D16618#353258, @kib wrote:
In D16618#353173, @alc wrote:

Isn't the root cause of this problem that process 0 shares its page table with the kernel pmap, but it doesn't use the kernel pmap as its pmap, i.e., the process 0 vm_map doesn't use the kernel pmap as its pmap? Remember every vm_map has a pointer to a pmap, and this pointer needn't point to the pmap storage within the same vmspace.

Index: amd64/amd64/pmap.c
===================================================================
--- amd64/amd64/pmap.c  (revision 337342)
+++ amd64/amd64/pmap.c  (working copy)
@@ -2655,6 +2655,7 @@ pmap_pinit0(pmap_t pmap)
                        __pcpu[i].pc_ucr3 = PMAP_NO_CR3;
                }
        }
+       printf("pmap: %p / kernel_pmap: %p\n", pmap, kernel_pmap);
        PCPU_SET(curpmap, kernel_pmap);
        pmap_activate(curthread);
        CPU_FILL(&kernel_pmap->pm_active);

and this prints:

pmap: 0xffffffff820010b8 / kernel_pmap: 0xffffffff821394f8

Yes, the problem is that, I tried to mention it by stating 'kernel pmap vs. kernel process'. But I do not see what else does it imply.

I considered changing kernel_pmap to point to the vmspace0 pmap on amd64, but decided to not do it. For instance, I can imagine that we might want to use the kernel page table for kernel processes other than proc0.

I agree that we should not make kernel_pmap point to the vmspace0 pmap. At the same time, I don't think that we should ever set curpmap to kernel_pmap. Instead, we should initialize it to the vmspace0 pmap, since that pmap is actually bound to a thread. Then, this problem goes away.

Also, consider this snippet from pmap_pinit0()

PCPU_SET(curpmap, kernel_pmap);
pmap_activate(curthread);
CPU_FILL(&kernel_pmap->pm_active);

The CPU_FILL() is performed to undo the clearing of the boot CPU's bit by pmap_activate(), maintaining the invariant that the kernel pmap is always active everywhere. Now, consider this snippet from init_secondary().

pc->pc_curpmap = kernel_pmap;
pc->pc_pcid_gen = 1;
pc->pc_pcid_next = PMAP_PCID_KERN + 1;

The first context switch by a secondary CPU results in the CPU's bit being permanently cleared from the kernel pmap.

The following patch demonstrates this fact.

Index: amd64/amd64/pmap.c
===================================================================
--- amd64/amd64/pmap.c  (revision 337447)
+++ amd64/amd64/pmap.c  (working copy)
@@ -3058,6 +3058,16 @@ kvm_free(SYSCTL_HANDLER_ARGS)
 SYSCTL_PROC(_vm, OID_AUTO, kvm_free, CTLTYPE_LONG|CTLFLAG_RD, 
     0, 0, kvm_free, "LU", "Amount of KVM free");
 
+static int
+count(SYSCTL_HANDLER_ARGS)
+{
+       unsigned long kfree = CPU_COUNT(&kernel_pmap->pm_active);
+
+       return sysctl_handle_long(oidp, &kfree, 0, req);
+}
+SYSCTL_PROC(_debug, OID_AUTO, count, CTLTYPE_LONG|CTLFLAG_RD, 
+    0, 0, count, "LU", "");
+
 /*
  * grow the number of kernel page table entries, if needed
  */

And, running this sysctl on a 6-core machine yields:

# sysctl debug.count
debug.count: 250

(You might have expected this to be 251, not 250, but without your patch, we keep running the entirety of pmap_activate(), not updating curpmap, until we switch to a non-kernel thread, and so the boot CPU's bit gets cleared from the kernel pmap despite the earlier CPU_FILL().)

So, if we actually want to maintain the invariant that the kernel pmap has a filled active mask, then this seems like another argument for initializing curpmap to vmspace0's pmap.

More, I think that it is mostly pointless to try to enforce the invariant (pmap1 != pmap2) => (pmap1->pm_cr3 != pmap2->pm_cr3). It is siimple enough and not costly to handle the missed case of switching pmaps which use the same page tables.

Is there any other case besides the kernel and vmspace0 where the pmap's cr3's are the same? This is, in part, why we have a distinct pmap_pinit function for vmspace0, to set up the shared page table.

In D16618#354773, @alc wrote:.

I agree that we should not make kernel_pmap point to the vmspace0 pmap. At the same time, I don't think that we should ever set curpmap to kernel_pmap. Instead, we should initialize it to the vmspace0 pmap, since that pmap is actually bound to a thread. Then, this problem goes away.

My first attempt to fix the reported problem looked like this:

diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c                    
index 572b2197453..f84f56b98e2 100644                                           
--- a/sys/amd64/amd64/pmap.c                                                    
+++ b/sys/amd64/amd64/pmap.c                                                    
@@ -2655,7 +2655,7 @@ pmap_pinit0(pmap_t pmap)                                  
                        __pcpu[i].pc_ucr3 = PMAP_NO_CR3;                        
                }
        }
-       PCPU_SET(curpmap, kernel_pmap);                                         
+       PCPU_SET(curpmap, pmap);                                                
        pmap_activate(curthread);                                               
        CPU_FILL(&kernel_pmap->pm_active);                                      
 }

It did not worked because of the cr3 comparison issue.

So after your note, I just re-add that snippet to the current patch.

Is there any other case besides the kernel and vmspace0 where the pmap's cr3's are the same? This is, in part, why we have a distinct pmap_pinit function for vmspace0, to set up the shared page table.

Might be there is no such situation anywhere else on amd64. But I am only chasing an inconsistency in the pmap_activate_sw() which must set curpmap in all modes PCID/non-PCID.

In D16618#354774, @kib wrote:
In D16618#354773, @alc wrote:.

I agree that we should not make kernel_pmap point to the vmspace0 pmap. At the same time, I don't think that we should ever set curpmap to kernel_pmap. Instead, we should initialize it to the vmspace0 pmap, since that pmap is actually bound to a thread. Then, this problem goes away.

My first attempt to fix the reported problem looked like this:

diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c                    
index 572b2197453..f84f56b98e2 100644                                           
--- a/sys/amd64/amd64/pmap.c                                                    
+++ b/sys/amd64/amd64/pmap.c                                                    
@@ -2655,7 +2655,7 @@ pmap_pinit0(pmap_t pmap)                                  
                        __pcpu[i].pc_ucr3 = PMAP_NO_CR3;                        
                }
        }
-       PCPU_SET(curpmap, kernel_pmap);                                         
+       PCPU_SET(curpmap, pmap);                                                
        pmap_activate(curthread);                                               
        CPU_FILL(&kernel_pmap->pm_active);                                      
 }

It did not worked because of the cr3 comparison issue.

This is insufficient because init_secondary() is still initializing curpmap on non-boot processors to kernel_pmap. Please give the following a try.

Index: amd64/amd64/mp_machdep.c
===================================================================
--- amd64/amd64/mp_machdep.c    (revision 337447)
+++ amd64/amd64/mp_machdep.c    (working copy)
@@ -57,6 +57,7 @@ __FBSDID("$FreeBSD$");
 #include <vm/vm.h>
 #include <vm/vm_param.h>
 #include <vm/pmap.h>
+#include <vm/vm_map.h>
 #include <vm/vm_kern.h>
 #include <vm/vm_extern.h>
 
@@ -293,7 +294,8 @@ init_secondary(void)
        pc->pc_gs32p = &gdt[NGDT * cpu + GUGS32_SEL];
        pc->pc_ldt = (struct system_segment_descriptor *)&gdt[NGDT * cpu +
            GUSERLDT_SEL];
-       pc->pc_curpmap = kernel_pmap;
+       CPU_SET_ATOMIC(cpu, &vmspace0.vm_pmap.pm_active);
+       pc->pc_curpmap = &vmspace0.vm_pmap;
        pc->pc_pcid_gen = 1;
        pc->pc_pcid_next = PMAP_PCID_KERN + 1;
        common_tss[cpu].tss_rsp0 = 0;
Index: amd64/amd64/pmap.c
===================================================================
--- amd64/amd64/pmap.c  (revision 337447)
+++ amd64/amd64/pmap.c  (working copy)
@@ -2634,6 +2634,7 @@ pmap_unuse_pt(pmap_t pmap, vm_offset_t va, pd_entr
 void
 pmap_pinit0(pmap_t pmap)
 {
+       u_int cpuid;
        int i;
 
        PMAP_LOCK_INIT(pmap);
@@ -2644,6 +2645,8 @@ pmap_pinit0(pmap_t pmap)
        pmap->pm_ucr3 = PMAP_NO_CR3;
        pmap->pm_root.rt_root = 0;
        CPU_ZERO(&pmap->pm_active);
+       cpuid = PCPU_GET(cpuid);
+       CPU_SET(cpuid, &pmap->pm_active);
        TAILQ_INIT(&pmap->pm_pvchunk);
        bzero(&pmap->pm_stats, sizeof pmap->pm_stats);
        pmap->pm_flags = pmap_flags;
@@ -2655,9 +2658,7 @@ pmap_pinit0(pmap_t pmap)
                        __pcpu[i].pc_ucr3 = PMAP_NO_CR3;
                }
        }
-       PCPU_SET(curpmap, kernel_pmap);
-       pmap_activate(curthread);
-       CPU_FILL(&kernel_pmap->pm_active);
+       PCPU_SET(curpmap, pmap);
 }
 
 void
@@ -7536,7 +7537,8 @@ pmap_activate_sw(struct thread *td)
                        PCPU_SET(kcr3, pmap->pm_cr3);
                        PCPU_SET(ucr3, pmap->pm_ucr3);
                }
-       }
+       } else
+               KASSERT(pmap == oldpmap, ("shared cr3"));
        if (pmap->pm_ucr3 != PMAP_NO_CR3) {
                rsp0 = ((vm_offset_t)PCPU_PTR(pti_stack) +
                    PC_PTI_STACK_SZ * sizeof(uint64_t)) & ~0xful;
In D16618#354775, @alc wrote:
In D16618#354774, @kib wrote:
}
  • PCPU_SET(curpmap, kernel_pmap);

+ PCPU_SET(curpmap, pmap);

pmap_activate(curthread);                                               
CPU_FILL(&kernel_pmap->pm_active);

}

It did not worked because of the cr3 comparison issue.

This is insufficient because init_secondary() is still initializing curpmap on non-boot processors to kernel_pmap. Please give the following a try.

Yes. There is even more dumb reason why it does not work, because pmap == curthread->pmap when pmap_activate() is called, and the first check in pmap_activate_sw() does nothing if old and new pmaps are same.

Index: amd64/amd64/mp_machdep.c
===================================================================
--- amd64/amd64/mp_machdep.c    (revision 337447)
+++ amd64/amd64/mp_machdep.c    (working copy)
@@ -293,7 +294,8 @@ init_secondary(void)
        pc->pc_gs32p = &gdt[NGDT * cpu + GUGS32_SEL];
        pc->pc_ldt = (struct system_segment_descriptor *)&gdt[NGDT * cpu +
            GUSERLDT_SEL];
-       pc->pc_curpmap = kernel_pmap;
+       CPU_SET_ATOMIC(cpu, &vmspace0.vm_pmap.pm_active);

This cannot work this early because proc0 is initialized much later than APs are fired: SI_SUB_CPU vs. SI_SUB_INTRINSIC. See my updated patch.

@@ -7536,7 +7537,8 @@ pmap_activate_sw(struct thread *td)

        PCPU_SET(kcr3, pmap->pm_cr3);
        PCPU_SET(ucr3, pmap->pm_ucr3);
}
  • }

+ } else
+ KASSERT(pmap == oldpmap, ("shared cr3"));

We check that pmap != oldpmap, so this can be just panic.

I still want this case to be consistent and just set curpmap instead.

Activate vmspace0.vm_pmap instead of kernel_pmap on boot, using new helper pmap_activate_boot().

This revision now requires review to proceed.Aug 12 2018, 10:53 PM
sys/amd64/amd64/pmap.c
2659 ↗(On Diff #46608)

This line is now redundant.

kib marked an inline comment as done.

Remove re-fill of kernel_pmap->pm_active, pmap_bootstrap() already did it and patch should make it constant.

In D16618#354785, @kib wrote:
In D16618#354775, @alc wrote:

@@ -7536,7 +7537,8 @@ pmap_activate_sw(struct thread *td)

        PCPU_SET(kcr3, pmap->pm_cr3);
        PCPU_SET(ucr3, pmap->pm_ucr3);
}
  • }

+ } else
+ KASSERT(pmap == oldpmap, ("shared cr3"));

We check that pmap != oldpmap, so this can be just panic.

I still want this case to be consistent and just set curpmap instead.

Or, the previous block

else if (cr3 != pmap->pm_cr3)

should become just else if because there is no point in checking cr3. We expect that cr3's are always different, and if they happen to be the same, we would only do some extra work for !PCID case.

Stop comparing cr3 on !PCID switch.

As a follow up, I would suggest porting this change to i386, and moving the pmap_activate_boot() call into init_secondary_tail().

sys/amd64/amd64/pmap.c
7564 ↗(On Diff #46617)

The reasons not to set curpmap to the kernel pmap aren't obvious, so I would suggest a KASSERT:

KASSERT(pmap != kernel_pmap, ("curpmap can't be kermel_pmap"));
This revision is now accepted and ready to land.Aug 14 2018, 3:55 PM

There is a nearby typo in pmap_activate_sw():

* happends after curpmap is updated.  Protect other
This revision now requires review to proceed.Aug 14 2018, 4:18 PM
This revision is now accepted and ready to land.Aug 14 2018, 4:21 PM
This revision was automatically updated to reflect the committed changes.