It's technically possible for read faults and copy-on-write faults
taken on pages that are shared-busy and completely valid to be handled
without waiting for the page to be unbusied. Add some counters to
track these cases in order to evaluate whether the optimization to
the vm_fault() path would be worth the effort.
Details
Diff Detail
- Repository
- rS FreeBSD src repository - subversion
- Lint
Lint Passed - Unit
No Test Coverage - Build Status
Buildable 36897 Build 33786: arc lint + arc unit
Event Timeline
I have no comments about usefulness of adding these counters. I have to note the patch negatively affects scalability as it adds loads from the page form something which gets modified by other threads. If adding these stats, it would have to happen inside of vm_page_busy_sleep.
However, if this is concerning the pmap discussion, I have to point you at https://reviews.freebsd.org/D26011 . While there is no consensus which way to go with that patch, there is a way to gather all these statistics with dtrace while running the load and without the above.
Regardless of what happens here, I think evaluating vmpfw will have to wait until someone sorts out the object concurrency branch as that's the primary bottleneck masking at least some of the load which would get here.
I kind of wanted to avoid stuffing these into the guts of vm_page_busy_sleep(), since I suspect we wouldn't care about them at all in the non-vm_fault() case, and in fact they could be misleading. Would another option be to conditionally compile the counters only for INVARIANTS (or DIAGNOSTIC)?
Or it might be better to keep this patch around without submitting it anyway. Results from buildworld:
vm.stats.fault.vmpfw_exclusive: 2048
vm.stats.fault.vmpfw_shared_cow: 2
vm.stats.fault.vmpfw_shared_read: 0
This is a small machine (2-vcpu guest), without Jeff's concurrency improvements. But not encouraging so far.
However, if this is concerning the pmap discussion, I have to point you at https://reviews.freebsd.org/D26011 . While there is no consensus which way to go with that patch, there is a way to gather all these statistics with dtrace while running the load and without the above.
Regardless of what happens here, I think evaluating vmpfw will have to wait until someone sorts out the object concurrency branch as that's the primary bottleneck masking at least some of the load which would get here.
It is very likely at least some of the busying, shared or exclusive, can be avoided. Some of the exclusive busying can be probably downgraded.
Even if some form of waiting has to be performed, the code can do adaptive spinning instead of going off cpu.
Regardless of the above, some of the currently present fcmpset loops can probably be replaced with fetchadd.
All in all, I would be very surprised if there were no significant wins here to get at 100-ish thread scale.
All of this has to wait for the major bottleneck to be sorted out though.
The 2050 total cases for the whole buildworld shades the doubts on the claim that "vmpfw" wait state is dominating e.g. buildworld load. Might be indeed, the problem is that the configuration is two-CPU, and that it is under VM (VM tends to provide scheduling slew, which nicely exposes races not seen on real hw, but is not useful for benchmarking). But you need to get some load where busy wait in vm_fault() is the problem.
The claim was that vm object domaintes off cpu time and limits concurrent traffic leading to vmpfw.
Here is fresh head doing -j 104 buildkernel on tmpfs: https://people.freebsd.org/~mjg/fg/flix1-buildkernel-offcpu.svg
vm object is at 74.24% and vmpfw at 8.30%, the latter will go up significantly after the former gets sorted out.