Page MenuHomeFreeBSD

Map constant zero page on read faults which touch non-existing anon page.
AbandonedPublic

Authored by kib on Apr 13 2018, 11:27 AM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, May 9, 7:23 PM
Unknown Object (File)
Sat, May 4, 9:24 PM
Unknown Object (File)
Sat, May 4, 9:14 PM
Unknown Object (File)
Sat, May 4, 8:57 PM
Unknown Object (File)
Sat, May 4, 8:22 PM
Unknown Object (File)
Sat, May 4, 4:28 PM
Unknown Object (File)
Sat, May 4, 4:01 PM
Unknown Object (File)
Sat, May 4, 4:00 PM
Subscribers

Details

Reviewers
jeff
markj
alc
Summary

Only x86 pmaps are handled in the initial version for discussion.

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

sys/vm/vm_fault.c
773

"nameless" and "anonymous" are synonyms, so this seems a bit redundant.

Should we also verify fs.object->handle == NULL?

775

"a transient"

785

If a direct map is available, a nice diagnostic check would be to verify that the page's contents really are 0.

kib marked 2 inline comments as done.Apr 27 2018, 3:40 PM
kib added inline comments.
sys/vm/vm_fault.c
773

Named anonymous object is the anon object that has external identity, typically its handle is non-NULL, but this is not required.

I believe that the check for OBJ_NOSPLIT flag is enough and is even better than the check for the handle. since in principle objects with NULL handles still can be nosplit. True anon memory, only which can be optimized this way, is splitable.

Mark' notes:

  • edit comment.
  • verify that zero_page is not dirtied.
sys/vm/vm_fault.c
773

I agree with Mark's comment about the usage of "nameless" and "anonymous" here. I can't recall us using a construction like this elsewhere in the comments. I would call it either an "anonymous default object" or an "unnamed default object".

When we are talking about objects, I would not equate "default" with "anonymous". For example, OBJT_SWAP can also be anonymous. Or, consider uipc_shm.c, which creates a "named default object". "default" really tells us that the object has no associated secondary storage (or other "backing", like a device), not whether it is named.

Update comment: s/anon/default/

sys/vm/vm_fault.c
767–771

Couldn't this be handled earlier in vm_fault_soft_fast() when "m == NULL"?

Move the optimization to vm_fault_soft_fast().

Has anyone actually measured how often this optimization gets triggered? I'm just curious.

sys/vm/vm_fault.c
308

Given that the object is read locked, not write locked, does this helper function actually correctly unlock the object?

In D15055#320924, @alc wrote:

Has anyone actually measured how often this optimization gets triggered? I'm just curious.

Even plain multiuser boot does trigger this code several times, it is me were sloppy with the testing of the last version.

The optimization was requested by Jeff for very specific benchmark, since Linux also does the same trick and apparently FreeBSD loose a lot due to this. See also related D14917.
I think actual numbers will be provided when Jeff returns.

sys/vm/vm_fault.c
773

Ok.

Correct exit after the zero_page installation.

In D15055#320927, @kib wrote:
In D15055#320924, @alc wrote:

Has anyone actually measured how often this optimization gets triggered? I'm just curious.

Even plain multiuser boot does trigger this code several times, it is me were sloppy with the testing of the last version.

In hindsight, the question that I should have asked is "How often does pmap_remove() encounter the zero page in the page table?" pmap_remove_pages() won't encounter the zero page because it's not mapped as a managed mapping. For "normal", i.e., writeable, virtual memory, I fear that this change is a pessimization. Without this change, on first touch, regardless of whether the access is a write, we will allocate a physical page and map it for write access. And so, this change would only increase the number of page faults. Moreover, in a multithreaded program, those page faults are going to have to perform a TLB shootdown, because we're changing the physical page being mapped. The cost of these additional page faults would have to be outweighed by the savings in the cases where pmap_remove() encountered a mapping to the zero page.

That said, I can see a variant of this change being an optimization for a more restricted set of cases, e.g., a read-only mapping of a file.

The optimization was requested by Jeff for very specific benchmark, since Linux also does the same trick and apparently FreeBSD loose a lot due to this. See also related D14917.
I think actual numbers will be provided when Jeff returns.

In D15055#320936, @alc wrote:

In hindsight, the question that I should have asked is "How often does pmap_remove() encounter the zero page in the page table?" pmap_remove_pages() won't encounter the zero page because it's not mapped as a managed mapping. For "normal", i.e., writeable, virtual memory, I fear that this change is a pessimization. Without this change, on first touch, regardless of whether the access is a write, we will allocate a physical page and map it for write access. And so, this change would only increase the number of page faults. Moreover, in a multithreaded program, those page faults are going to have to perform a TLB shootdown, because we're changing the physical page being mapped. The cost of these additional page faults would have to be outweighed by the savings in the cases where pmap_remove() encountered a mapping to the zero page.

I applied the following patch:

diff --git a/sys/amd64/amd64/pmap.c b/sys/amd64/amd64/pmap.c
index 86ed13ecb91..5085f5d2607 100644
--- a/sys/amd64/amd64/pmap.c
+++ b/sys/amd64/amd64/pmap.c
@@ -4037,6 +4037,10 @@ pmap_remove_pde(pmap_t pmap, pd_entry_t *pdq, vm_offset_t sva,
 	return (pmap_unuse_pt(pmap, sva, *pmap_pdpe(pmap, sva), free));
 }
 
+static int xxx, yyy;
+SYSCTL_INT(_vm_pmap, OID_AUTO, xxx, CTLFLAG_RD, &xxx, 0, "");
+SYSCTL_INT(_vm_pmap, OID_AUTO, yyy, CTLFLAG_RD, &yyy, 0, "");
+
 /*
  * pmap_remove_pte: do the things to unmap a page in a process
  */
@@ -4056,6 +4060,7 @@ pmap_remove_pte(pmap_t pmap, pt_entry_t *ptq, vm_offset_t va,
 	oldpte = pte_load_clear(ptq);
 	if (oldpte & PG_W)
 		pmap->pm_stats.wired_count -= 1;
+if ((oldpte & PG_FRAME) == VM_PAGE_TO_PHYS(zero_page)) atomic_add_int(&xxx, 1);
 	pmap_resident_count_dec(pmap, 1);
 	if (oldpte & PG_MANAGED) {
 		m = PHYS_TO_VM_PAGE(oldpte & PG_FRAME);
@@ -4880,6 +4885,7 @@ pmap_enter(pmap_t pmap, vm_offset_t va, vm_page_t m, vm_prot_t prot,
 		origpte = pte_load_store(pte, newpte);
 		opa = origpte & PG_FRAME;
 		if (opa != pa) {
+if (opa == VM_PAGE_TO_PHYS(zero_page)) atomic_add_int(&yyy, 1);
 			if ((origpte & PG_MANAGED) != 0) {
 				om = PHYS_TO_VM_PAGE(opa);
 				if ((origpte & (PG_M | PG_RW)) == (PG_M |

and the multu-user boot shows the following number:

vm.pmap.xxx: 396
vm.pmap.yyy: 3423

Can you try a "buildworld" with the counters in place?

In D15055#320951, @alc wrote:

Can you try a "buildworld" with the counters in place?

Before build:

vm.pmap.yyy: 4074
vm.pmap.xxx: 442

after

vm.pmap.yyy: 9118675
vm.pmap.xxx: 526736
In D15055#320936, @alc wrote:
In D15055#320927, @kib wrote:
In D15055#320924, @alc wrote:

Has anyone actually measured how often this optimization gets triggered? I'm just curious.

Even plain multiuser boot does trigger this code several times, it is me were sloppy with the testing of the last version.

In hindsight, the question that I should have asked is "How often does pmap_remove() encounter the zero page in the page table?" pmap_remove_pages() won't encounter the zero page because it's not mapped as a managed mapping. For "normal", i.e., writeable, virtual memory, I fear that this change is a pessimization. Without this change, on first touch, regardless of whether the access is a write, we will allocate a physical page and map it for write access. And so, this change would only increase the number of page faults. Moreover, in a multithreaded program, those page faults are going to have to perform a TLB shootdown, because we're changing the physical page being mapped. The cost of these additional page faults would have to be outweighed by the savings in the cases where pmap_remove() encountered a mapping to the zero page.

That said, I can see a variant of this change being an optimization for a more restricted set of cases, e.g., a read-only mapping of a file.

The optimization was requested by Jeff for very specific benchmark, since Linux also does the same trick and apparently FreeBSD loose a lot due to this. See also related D14917.
I think actual numbers will be provided when Jeff returns.

I can see about finding a specific benchmark. It's actually more of a memory optimization than a performance optimization. Apparently there are many programs that rely on allocating a large anonymous region to manage a tree or hash and then sparsely populating it. They are using the very large vm space rather than manually handling discontiguous memory regions.

I agree that the extra faults are undesirable. Could we pre-populate unbacked anonymous memory on mmap to avoid the first fault? Filling a range of PTEs with a zero page should be very cheap. Are there perhaps other interesting heuristics we could use to limit the scope? Perhaps we should consider some macro-benchmarks to see how much it slows down a multi-threaded program that isn't making good use of the feature? Any buildworld times yet?

In D15055#320998, @jeff wrote:

I agree that the extra faults are undesirable. Could we pre-populate unbacked anonymous memory on mmap to avoid the first fault? Filling a range of PTEs with a zero page should be very cheap. Are there perhaps other interesting heuristics we could use to limit the scope? Perhaps we should consider some macro-benchmarks to see how much it slows down a multi-threaded program that isn't making good use of the feature? Any buildworld times yet?

I do not think that pre-populating the anon map on mmap(2) is feasible. For the large sparse mappings we would still wire enough memory for the page tables, and just spend a lot of time filling tables.

I demonstrated the behaviour of the patch on the buildworld load, but did not measured time. I think what happens there are reads from implicitly initialized variables from the bss section, before they are written to.

It might be reasonable to pre-populate zero PTEs ahead/behind the current fault, limited to the current page table page, i.e. stopping on the first valid page/mapping limit/superpage boundary. I doubt that it would be of much help for e.g. buildworld, but it should help more for the specific benchmarks.

In D15055#320998, @jeff wrote:
In D15055#320936, @alc wrote:
In D15055#320927, @kib wrote:
In D15055#320924, @alc wrote:

Has anyone actually measured how often this optimization gets triggered? I'm just curious.

Even plain multiuser boot does trigger this code several times, it is me were sloppy with the testing of the last version.

In hindsight, the question that I should have asked is "How often does pmap_remove() encounter the zero page in the page table?" pmap_remove_pages() won't encounter the zero page because it's not mapped as a managed mapping. For "normal", i.e., writeable, virtual memory, I fear that this change is a pessimization. Without this change, on first touch, regardless of whether the access is a write, we will allocate a physical page and map it for write access. And so, this change would only increase the number of page faults. Moreover, in a multithreaded program, those page faults are going to have to perform a TLB shootdown, because we're changing the physical page being mapped. The cost of these additional page faults would have to be outweighed by the savings in the cases where pmap_remove() encountered a mapping to the zero page.

That said, I can see a variant of this change being an optimization for a more restricted set of cases, e.g., a read-only mapping of a file.

The optimization was requested by Jeff for very specific benchmark, since Linux also does the same trick and apparently FreeBSD loose a lot due to this. See also related D14917.
I think actual numbers will be provided when Jeff returns.

I can see about finding a specific benchmark. It's actually more of a memory optimization than a performance optimization. Apparently there are many programs that rely on allocating a large anonymous region to manage a tree or hash and then sparsely populating it. They are using the very large vm space rather than manually handling discontiguous memory regions.

See http://yarchive.net/comp/linux/ZERO_PAGE.html and https://lwn.net/Articles/515526/ for some background. Kostik, could I talk you into downloading the NAS parallel benchmarks from https://www.nas.nasa.gov/publications/npb.html and see what the "xxx" and "yyy" counts for FT? It will require a Fortran compiler.

To be clear, running the serial version of the benchmark would suffice.

In D15055#321321, @alc wrote:

To be clear, running the serial version of the benchmark would suffice.

Is there any tuning that needs to be done ?

vm.pmap.yyy: 68
vm.pmap.xxx: 4
# LD_LIBRARY_PATH=. ./ft.C.x 


 NAS Parallel Benchmarks (NPB3.3-SER) - FT Benchmark

 Size                :  512x 512x 512
 Iterations          :             20

 T =    1     Checksum =    5.195078707457D+02    5.149019699238D+02
 T =    2     Checksum =    5.155422171134D+02    5.127578201997D+02
 T =    3     Checksum =    5.144678022222D+02    5.122251847514D+02
 T =    4     Checksum =    5.140150594328D+02    5.121090289018D+02
 T =    5     Checksum =    5.137550426810D+02    5.121143685824D+02
 T =    6     Checksum =    5.135811056728D+02    5.121496764568D+02
 T =    7     Checksum =    5.134569343165D+02    5.121870921893D+02
 T =    8     Checksum =    5.133651975661D+02    5.122193250322D+02
 T =    9     Checksum =    5.132955192805D+02    5.122454735794D+02
 T =   10     Checksum =    5.132410471738D+02    5.122663649603D+02
 T =   11     Checksum =    5.131971141679D+02    5.122830879827D+02
 T =   12     Checksum =    5.131605205716D+02    5.122965869718D+02
 T =   13     Checksum =    5.131290734194D+02    5.123075927445D+02
 T =   14     Checksum =    5.131012720314D+02    5.123166486553D+02
 T =   15     Checksum =    5.130760908195D+02    5.123241541685D+02
 T =   16     Checksum =    5.130528295923D+02    5.123304037599D+02
 T =   17     Checksum =    5.130310107773D+02    5.123356167976D+02
 T =   18     Checksum =    5.130103090133D+02    5.123399592211D+02
 T =   19     Checksum =    5.129905029333D+02    5.123435588985D+02
 T =   20     Checksum =    5.129714421109D+02    5.123465164008D+02
 Verification test for FT successful


 FT Benchmark Completed.
 Class           =                        C
 Size            =            512x 512x 512
 Iterations      =                       20
 Time in seconds =                   382.44
 Mop/s total     =                  1036.47
 Operation type  =           floating point
 Verification    =               SUCCESSFUL
 Version         =                    3.3.1
 Compile date    =              30 Apr 2018

 Compile options:
    F77          = gfortran
    FLINK        = $(F77)
    F_LIB        = (none)
    F_INC        = (none)
    FFLAGS       = -O -mcmodel=large
    FLINKFLAGS   = -O -Wl,-rpath,/usr/local/opt/gcc-7.3.0/lib
    RAND         = randi8


 Please send all errors/feedbacks to:

 NPB Development Team
 npb@nas.nasa.gov

vm.pmap.yyy: 89
vm.pmap.xxx: 6
In D15055#321332, @kib wrote:
In D15055#321321, @alc wrote:

To be clear, running the serial version of the benchmark would suffice.

Is there any tuning that needs to be done ?

No. This application is allocating at least 1 GB of memory, and the "xxx" and "yyy" counts are miniscule. Clearly, the first touch to virtually all of the data is a write access.

However, the second link that I cited talked about the class A problem size. Can you please try running that instead?

In D15055#322979, @alc wrote:
In D15055#321332, @kib wrote:
In D15055#321321, @alc wrote:

To be clear, running the serial version of the benchmark would suffice.

Is there any tuning that needs to be done ?

No. This application is allocating at least 1 GB of memory, and the "xxx" and "yyy" counts are miniscule. Clearly, the first touch to virtually all of the data is a write access.

However, the second link that I cited talked about the class A problem size. Can you please try running that instead?

In fact I started with ft.A.x when I did the testing, but there it was even less interesting than for ft.C.x. The counter's increment was about 1 or 2. This is why I changed to C and also asked about tuning.

I can re-test but I do not see the point.

In D15055#323469, @kib wrote:

In fact I started with ft.A.x when I did the testing, but there it was even less interesting than for ft.C.x. The counter's increment was about 1 or 2. This is why I changed to C and also asked about tuning.

I can re-test but I do not see the point.

I agree.

In D15055#324256, @alc wrote:
In D15055#323469, @kib wrote:

In fact I started with ft.A.x when I did the testing, but there it was even less interesting than for ft.C.x. The counter's increment was about 1 or 2. This is why I changed to C and also asked about tuning.

I can re-test but I do not see the point.

I agree.

So the consensus is this doesn't actually significantly optimize anything we can find? It certainly isn't worth extra faults in that case.

In D15055#324434, @jeff wrote:

So the consensus is this doesn't actually significantly optimize anything we can find? It certainly isn't worth extra faults in that case.

I am not sure about the consensus. You do have the test where this change helps, right ?

From the links that Alan posted, it seems that Linux also had the resistance to get the similar patch accepted, but ultimately they merged it. Might be we should add a knob to enable zero page mapping. Aside from the required changes to non-x86 pmap, it is rather easy to handle.

In D15055#324440, @kib wrote:
In D15055#324434, @jeff wrote:

So the consensus is this doesn't actually significantly optimize anything we can find? It certainly isn't worth extra faults in that case.

I am not sure about the consensus. You do have the test where this change helps, right ?

From the links that Alan posted, it seems that Linux also had the resistance to get the similar patch accepted, but ultimately they merged it. Might be we should add a knob to enable zero page mapping. Aside from the required changes to non-x86 pmap, it is rather easy to handle.

In regards to the 4KB zero page, the situation was actually the reverse. People were proposing the removal of the zero-page "optimization", arguing that it was not an effective optimization. Essentially, they had done similar measurements to ours. But, Linux has had this "optimization" from the beginning, before it had multiprocessor support (and the need for TLB shootdown). Linus resisted the proposal citing unspecified Fortran programs as the reason for keeping it.

A 2MB zero page was added to Linux based on claims that it reduced physical memory usage in one particular NAS benchmark (written in Fortran). However, Kostik tried this same program on FreeBSD with this patch and found no evidence that there was any physical memory savings.

I'll add that one of my graduate students has been doing tests with the NAS benchmarks, specifically, the variants for shared-memory multiprocessors using OpenMP. To their credit, the NAS benchmarks parallelize the initialization phase as well as the core computation. (Also, the parallelization of the initialization typically aligns well with the parallelization of the core computation, so a first-touch NUMA policy outperforms interleave.) So, any usage of the zero-page "optimization" where the page is written to and replaced by a physical page is going to require a system-wide TLB shootdown because the initialization is being done in parallel across the entire machine.

Clearly, one can write a program that uses less physical memory with the zero-page "optimization", but I don't think that we're going to come across a non-trivial number of real-world programs where that is the case. And, more often we're going to see cases where implementing the zero page is going to add run-time overhead. I would say that any knob enabling it should be per-process, perhaps an madvise option or mmap flag.

In D15055#364225, @alc wrote:
In D15055#324440, @kib wrote:
In D15055#324434, @jeff wrote:

So the consensus is this doesn't actually significantly optimize anything we can find? It certainly isn't worth extra faults in that case.

I am not sure about the consensus. You do have the test where this change helps, right ?

From the links that Alan posted, it seems that Linux also had the resistance to get the similar patch accepted, but ultimately they merged it. Might be we should add a knob to enable zero page mapping. Aside from the required changes to non-x86 pmap, it is rather easy to handle.

In regards to the 4KB zero page, the situation was actually the reverse. People were proposing the removal of the zero-page "optimization", arguing that it was not an effective optimization. Essentially, they had done similar measurements to ours. But, Linux has had this "optimization" from the beginning, before it had multiprocessor support (and the need for TLB shootdown). Linus resisted the proposal citing unspecified Fortran programs as the reason for keeping it.

A 2MB zero page was added to Linux based on claims that it reduced physical memory usage in one particular NAS benchmark (written in Fortran). However, Kostik tried this same program on FreeBSD with this patch and found no evidence that there was any physical memory savings.

I'll add that one of my graduate students has been doing tests with the NAS benchmarks, specifically, the variants for shared-memory multiprocessors using OpenMP. To their credit, the NAS benchmarks parallelize the initialization phase as well as the core computation. (Also, the parallelization of the initialization typically aligns well with the parallelization of the core computation, so a first-touch NUMA policy outperforms interleave.) So, any usage of the zero-page "optimization" where the page is written to and replaced by a physical page is going to require a system-wide TLB shootdown because the initialization is being done in parallel across the entire machine.

Clearly, one can write a program that uses less physical memory with the zero-page "optimization", but I don't think that we're going to come across a non-trivial number of real-world programs where that is the case. And, more often we're going to see cases where implementing the zero page is going to add run-time overhead. I would say that any knob enabling it should be per-process, perhaps an madvise option or mmap flag.

I do not think that it makes sense to add a knob that requires application to explicitly tune it. Both because I do not believe that any app would bother with this specifically for FreeBSD optimization, and because it adds a delicate code to handle never-used mode of operation.

In D15055#364249, @kib wrote:
In D15055#364225, @alc wrote:
In D15055#324440, @kib wrote:
In D15055#324434, @jeff wrote:

So the consensus is this doesn't actually significantly optimize anything we can find? It certainly isn't worth extra faults in that case.

I am not sure about the consensus. You do have the test where this change helps, right ?

From the links that Alan posted, it seems that Linux also had the resistance to get the similar patch accepted, but ultimately they merged it. Might be we should add a knob to enable zero page mapping. Aside from the required changes to non-x86 pmap, it is rather easy to handle.

In regards to the 4KB zero page, the situation was actually the reverse. People were proposing the removal of the zero-page "optimization", arguing that it was not an effective optimization. Essentially, they had done similar measurements to ours. But, Linux has had this "optimization" from the beginning, before it had multiprocessor support (and the need for TLB shootdown). Linus resisted the proposal citing unspecified Fortran programs as the reason for keeping it.

A 2MB zero page was added to Linux based on claims that it reduced physical memory usage in one particular NAS benchmark (written in Fortran). However, Kostik tried this same program on FreeBSD with this patch and found no evidence that there was any physical memory savings.

I'll add that one of my graduate students has been doing tests with the NAS benchmarks, specifically, the variants for shared-memory multiprocessors using OpenMP. To their credit, the NAS benchmarks parallelize the initialization phase as well as the core computation. (Also, the parallelization of the initialization typically aligns well with the parallelization of the core computation, so a first-touch NUMA policy outperforms interleave.) So, any usage of the zero-page "optimization" where the page is written to and replaced by a physical page is going to require a system-wide TLB shootdown because the initialization is being done in parallel across the entire machine.

Clearly, one can write a program that uses less physical memory with the zero-page "optimization", but I don't think that we're going to come across a non-trivial number of real-world programs where that is the case. And, more often we're going to see cases where implementing the zero page is going to add run-time overhead. I would say that any knob enabling it should be per-process, perhaps an madvise option or mmap flag.

I do not think that it makes sense to add a knob that requires application to explicitly tune it. Both because I do not believe that any app would bother with this specifically for FreeBSD optimization, and because it adds a delicate code to handle never-used mode of operation.

Then, I would just abandon it.