Page MenuHomeFreeBSD

amd64 pmap: convert to counter(9), add PV and pagetable page counts

Authored by jah on Feb 25 2021, 5:37 PM.



This change converts most of the counters in the amd64 pmap from
either handrolled per-CPU counters or global atomics to scalable
counter(9) counters.

The bulk of these counters remain guarded by PV_STATS, as it seems
unlikely that they will be useful outside of very specific debugging
scenarios. However, this change does add two new counters that
that are available without PV_STATS. pt_page_count and pv_page_count
track the number of active physical-to-virtual list pages and page
table pages, respectively. These will be useful in evaluating
the memory footprint of pmap structures under various workloads,
which will help to guide future changes in this area.

Diff Detail

rG FreeBSD src repository
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

jah requested review of this revision.Feb 25 2021, 5:37 PM

Removing the handrolled pc_pm_save_cnt seemed like a useful bit of cleanup. But I realize this also replaces an incq instruction on the pcpu region with a (larger) addq instruction on zpcpu; the PCID save count is also updated pretty frequently since it's managed on the context switch path. I'm fine reverting this part of the change if anyone objects to it.


It's trivial to patch macros to use incq instead of addq automatically, I did not do it because it did not seem worth it from looking Agner Fog's instruction tables.

The real problem with counter(9) is that even counters which are known to be there at compilation time require fetching something to find the address.

There should be a facility to sort them out by storing everything in struct pcpu.


I suspect that I was the only user for this counter, and I would be fine with removing it altogether. Last time it was useful was when I split pmap_activate_sw() into ifuncs, to more easily see that about right ifunc was selected.

But I also do not object against keeping it around.


TLB flushes


Does it make sense to convert these to counters, even under PV_STATS? The event is relatively rare, and when it happens, we already lost a race.

I am not opposing, but curious if there some visible reason.


Promotions/demotions are quire rare.


I'm fine with removing it if you are. On a related note, it looks like this is the only remaining in-tree use of PCPU_INC; would it be worth removing PCPU_INC entirely in a separate change? Non-x86 architectures just implement it as PCPU_ADD(..., 1) anyway.


Not really; I wasn't sure how frequently these might be updated under load, and I didn't see a downside to making them counter(9) since they would still be guarded by PV_STATS anyway.


That's a little bit surprising to me. I'm not surprised that promotions are rare, but I would've expected that 2M->4K demotions might happen frequently under certain workloads, such as heavy forking of processes with large CoW data segments.


This is on my relatively small 32G w/s, stable/12, uptime 70 days, load is poudriere + buildworld/buildkernel + emacs/browsers:

vm.pmap.pde.promotions: 52283
vm.pmap.pde.p_failures: 16395
vm.pmap.pde.mappings: 78953
vm.pmap.pde.demotions: 20526

On my workstation, also with 32GB of RAM and with 1 day of uptime:

vm.pmap.pde.promotions: 87908
vm.pmap.pde.p_failures: 75207
vm.pmap.pde.mappings: 40585
vm.pmap.pde.demotions: 29985

Once physical memory is fragmented enough these counters are effectively frozen.

For now I suspect that converting these to counter(9) will not benefit anything but I see no problem with it. If we gain some mechanisms for online defragmentation and speculative promotions, then these counters might be updated more frequently.


Rather than annotating each vm_page_alloc() call, I would suggest adding pmap_alloc_ptp() and pmap_free_ptp() which wrap vm_page_alloc() and vm_page_unwire_noq()/vm_page_free(), respectively.


Yeah, that's a good idea.


Need to only do this if m != NULL. I noticed that we don't pass VM_ALLOC_WIRED here, is that intentional?
I can't imagine a pagetable page ever not being wired, but the large map seems to be a static map that's initialized early, so I'm wondering if lack of VM_ALLOC_WIRED is some artifact of that.


There are different meaning for 'wired' there. First, page table pages are unmanaged, so pagedaemon never processes them and m->ref_count is free for other uses. Unmanaged property is what you probably mean by saying that page table page cannot be not wired.

Second, we systematically use the ref_count of the page table page to track number of non-zero pte's in it, and most of the pmap code frees the page table page when ref_count goes to zero. In principle, this is true for the large map, with the caveat that we never free PDP pages because corresponding PML4 entries needs to be stable, they are copied into each pmap' pml4 page. [This can be improved for LA57, but kernel pretends that it is in LA48 regardless of the userspace page tables]

So VM_ALLOC_WIRED is not used for large map page table handling code only because it is more convenient to manually increment ref_count where appropriate.

  • Remove PCID save count, factor out PT page allocation/freeing

Can you split out pmap_alloc_pt_page() change from the other parts.

I think that the other parts would be fine as is. For pmap_alloc_pt_page(), it seems useful to modify it further. If you pass pmap to alloc_pt_page(), you can do two things: 1. move related resident count updates there 2. use separate counters for all user pmaps vs kernel pmap.


Stray new line.

  • Revert "Remove PCID save count, factor out PT page allocation/freeing"
  • Fix NULL check in pmap_large_map_getptp_unlocked()

I probably should have been more explicit, I proposed to drop the pt_page_count at all. But I hope that you will follow up with pmap_pt_page_alloc() patch so this does not matter much.

This revision is now accepted and ready to land.Mar 3 2021, 5:51 AM
In D28923#650048, @kib wrote:

I probably should have been more explicit, I proposed to drop the pt_page_count at all. But I hope that you will follow up with pmap_pt_page_alloc() patch so this does not matter much.

It won't matter; I'll shortly post a review that brings back an expanded version of pmap_alloc_pt_page(). Doing things this way just made it easier for me to manage the branches.