This change shows that 75% of the iterations performed within vm_fault_prefault() are pointless and could be eliminated by a two-line change to how we handle copy-on-write faults.
Details
Here are some results for a parallel buildworld.
# /tmp/worldmark_j7_HEAD_counters Sat Sep 29 18:56:23 CDT 2018 debug.counters.negmap: 18 debug.counters.nonneg: 35826 debug.counters.negati: 114882 vm.pmap.pde.promotions: 152 vm.pmap.pde.p_failures: 19 vm.pmap.pde.mappings: 3 vm.pmap.pde.demotions: 21 vm.reserv.reclaimed: 0 vm.reserv.partpopq: DOMAIN LEVEL SIZE NUMBER 0, -1, 211548K, 114 vm.reserv.fullpop: 102 vm.reserv.freed: 2335 vm.reserv.broken: 0 18546.859u 1062.556s 57:26.81 568.9% 52927+3525k 52421+80938io 15365pf+0w Sat Sep 29 19:53:50 CDT 2018 debug.counters.negmap: 90 debug.counters.nonneg: 6555245 debug.counters.negati: 19885568 vm.pmap.pde.promotions: 63267 vm.pmap.pde.p_failures: 9182 vm.pmap.pde.mappings: 186495 vm.pmap.pde.demotions: 4827 vm.reserv.reclaimed: 0 vm.reserv.partpopq: DOMAIN LEVEL SIZE NUMBER 0, -1, 250360K, 163 vm.reserv.fullpop: 114 vm.reserv.freed: 1241731 vm.reserv.broken: 0 18542.221u 1067.328s 57:14.07 571.0% 52915+3525k 5267+80880io 0pf+0w Sat Sep 29 20:51:04 CDT 2018 debug.counters.negmap: 90 debug.counters.nonneg: 12969217 debug.counters.negati: 39538213 vm.pmap.pde.promotions: 126439 vm.pmap.pde.p_failures: 18223 vm.pmap.pde.mappings: 373031 vm.pmap.pde.demotions: 9730 vm.reserv.reclaimed: 0 vm.reserv.partpopq: DOMAIN LEVEL SIZE NUMBER 0, -1, 250308K, 162 vm.reserv.fullpop: 115 vm.reserv.freed: 2477894 vm.reserv.broken: 0
Diff Detail
- Lint
Lint Skipped - Unit
Tests Skipped
Event Timeline
vm/vm_fault.c | ||
---|---|---|
1184–1186 | The committed change would just be this one snippet with -1 changed to 1. |
vm/vm_fault.c | ||
---|---|---|
1184–1186 | To be clear, the point of the faultcount == 0 test is to distinguish between faults that are simultaneously hard faults and copy-on-write faults, and those that are simply copy-on-write faults. In the former case, we still want to prefault the surrounding virtual pages that aren't be written to. I would add a comment explaining this as part of the committed change. |
vm/vm_fault.c | ||
---|---|---|
1295 | I am confused. Does not setting faultcount to -1 changes the ahead/behind parameters passed to vm_fault_prefault() ? I.e., your current patch only aimed at counting hard-cow-faults, not to measure timings with the eliminated unneeded prefaults, right ? |
vm/vm_fault.c | ||
---|---|---|
1295 | No, changing zero to minus one has no effect on the ahead and behind values passed to vm_fault_prefault(). They are going to be PFFOR and PFBAK in either case. Notice that I'm only changing faultcount to one when its value was still zero at the point where we performed the copy. If the copy-on-write fault was also a hard fault, then the faultcount would have already been updated to a non-zero value. Essentially, these results argue that soft, copy-on-write faults shouldn't call vm_fault_prefault(). Faults of this type currently represent 75% of the work done in vm_fault_prefault(), and yield near zero mappings created. The mappings were already created by prefaulting of resident pages at execve time or by earlier hard faults. |
Could you guys try running with this change for a while so that we can see if other workloads actually create more useful mappings on soft, copy-on-write faults?
Peter, could you please test this patch? This patch is actually a diagnostic patch, not a fix, so I'm interested in seeing the output from "sysctl debug.counters" at the end of your tests.
After ~20 minutes of browsing and doing some work in a few terminals (but no builds), I see:
debug.counters.negmap: 6688
debug.counters.nonneg: 7201693
debug.counters.negati: 1053344
After running a "make index" in the ports tree (which spawns many long pipelines), I see:
debug.counters.negmap: 6985
debug.counters.nonneg: 10848239
debug.counters.negati: 68424347
I ran tests on two different hosts for 11 hours and got:
Mon Oct 1 17:17:03 CEST 2018 debug.counters.negmap: 11038668 debug.counters.nonneg: 3034528887 debug.counters.negati: 10814460156 vm.pmap.pde.promotions: 4172630 vm.pmap.pde.p_failures: 5155243 vm.pmap.pde.mappings: 15384 vm.pmap.pde.demotions: 40830 vm.reserv.reclaimed: 66557 vm.reserv.partpopq: DOMAIN LEVEL SIZE NUMBER 0, -1, 101648K, 50 1, -1, 93696K, 46 vm.reserv.fullpop: 198 vm.reserv.freed: 287519860 vm.reserv.broken: 66559 [pho@flix1a ~]$ Mon Oct 1 17:17:55 CEST 2018 debug.counters.negmap: 58054 debug.counters.nonneg: 742065807 debug.counters.negati: 1039443536 vm.pmap.pde.promotions: 2761682 vm.pmap.pde.p_failures: 16059695 vm.pmap.pde.mappings: 911081 vm.pmap.pde.demotions: 35005 vm.reserv.reclaimed: 48387 vm.reserv.partpopq: DOMAIN LEVEL SIZE NUMBER 0, -1, 24204K, 14 vm.reserv.fullpop: 2289 vm.reserv.freed: 29833419 vm.reserv.broken: 48388 [pho@mercat1 ~]$
I *really* wish that vm_fault() knew whether the copy-on-write fault was caused by a protection violation or an invalid mapping. I think that a better heuristic for determining whether to call vm_fault_prefault() would be based on that knowledge. Specifically, I would call vm_fault_prefault() when the mapping was invalid.
Currently, I restrict calls to vm_fault_prefault() to hard faults, but hard faults are only a subset of the copy-on-write faults that occur on previously invalid mappings. For example, in the case of Mark's web browsing, I suspect that one Firefox process takes a hard copy-on-write fault on some memory-mapped file, and then another Firefox process takes a soft copy-on-write fault on that same page. I think that we should be calling vm_fault_prefault() on such soft copy-on-write faults, because the neighboring mappings are most likely invalid as well.
vm/vm_fault.c | ||
---|---|---|
1184–1185 | To be clear, here is what I intend to commit. Index: vm/vm_fault.c =================================================================== --- vm/vm_fault.c (revision 339445) +++ vm/vm_fault.c (working copy) @@ -1181,7 +1181,17 @@ readrest: */ vm_object_pip_wakeup(fs.object); VM_OBJECT_WUNLOCK(fs.object); + /* + * We only try to prefault read-only mappings to the + * neighboring pages when this copy-on-write fault is + * a hard fault. In other cases, trying to prefault + * is typically wasted effort. + */ + if (faultcount == 0) + faultcount = 1; + + /* * Only use the new page below... */ fs.object = fs.first_object; |
We can try to approximately calculate this information, but it is not reliable even on x86. For instance, we get a nonsensical error code for spurious page faults. But still, translating the bits from x86 exceptional error code into some additional information to vm_fault() is easy.
It might be that vm_fault() should be prepared to not get the hint about the fault cause at all.
Yes, it might be interesting to pass along PGEX_P to vm_fault(). Can you please elaborate on what error codes that you've seen from spurious page faults?
Unfortunately, this patch destabilized arm kernel.
The bad thing is that my Tegra does not panic, just hangs hard and I can’t even enter the JTAG debugger [1].
Moreover, this issue is very rare, I can do finish full buildkernel in most cases. The only known testcase is full
build of qt5-webengine (~16k of C++ files, +10 hours of compilation time).
I’m pretty sure that this change provoked some latent bug in arm pmap code, but due to nature of this bug,
I really need some pointer where to start.
Can you, please, give me some hints, what exactly does this change from pmap point of view?
[1] On Tegra, same hard hang can be caused, for example, by access to unimplemented register in device
MMIO, attempt to access to device which is in reset or have gated clock.
Simply, I have no idea where to start debugging this issue.
Thanks,
Michal
The patch would cause some less calls to pmap_enter_quick(). Even that, only relevant to the user maps.
Since it reduces exposure of pages to userspace, I doubt that your description of the hang causes is telling.
Are you sure that this is the patch consequence, and not some other, either environmental or toolchain effect ? Eg. it might change kernel text layout to trigger something.
Only kernel was changed during bisecting, nothing else. And all kernels was cross-compiled by
same toolchain. Kernel before r339819 works, r339819 hangs, and fresh current with r339819
reverted also works. Also, during bisecting, any kernel after r339819 hangs.
So I thing that r339819 is culprit.
Please don't take hang issues list as complete, the JTAG debugger can halt CPU only on
instruction boundary (or on exception entry). It's unclear if there is any other reason why
CPU cannot finish instruction other that bus hang.
But I understand that this is strange, that why I asking for help :)