Add basic stats for vm_fault() busy-page waits
Needs ReviewPublic
Actions

Authored by jah on Feb 11 2021, 6:02 PM.

Details

Reviewers

kib
markj
mjg

Summary

It's technically possible for read faults and copy-on-write faults
taken on pages that are shared-busy and completely valid to be handled
without waiting for the page to be unbusied. Add some counters to
track these cases in order to evaluate whether the optimization to
the vm_fault() path would be worth the effort.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 36897
Build 33786: arc lint + arc unit

Event Timeline

jah created this revision.Feb 11 2021, 6:02 PM

Herald added a subscriber: imp. · View Herald TranscriptFeb 11 2021, 6:02 PM

jah requested review of this revision.Feb 11 2021, 6:02 PM

Harbormaster completed remote builds in B36897: Diff 83701.Feb 11 2021, 6:03 PM

I have no comments about usefulness of adding these counters. I have to note the patch negatively affects scalability as it adds loads from the page form something which gets modified by other threads. If adding these stats, it would have to happen inside of vm_page_busy_sleep.

However, if this is concerning the pmap discussion, I have to point you at https://reviews.freebsd.org/D26011 . While there is no consensus which way to go with that patch, there is a way to gather all these statistics with dtrace while running the load and without the above.

Regardless of what happens here, I think evaluating vmpfw will have to wait until someone sorts out the object concurrency branch as that's the primary bottleneck masking at least some of the load which would get here.

In D28592#640118, @mjg wrote:

I have no comments about usefulness of adding these counters. I have to note the patch negatively affects scalability as it adds loads from the page form something which gets modified by other threads. If adding these stats, it would have to happen inside of vm_page_busy_sleep.

I kind of wanted to avoid stuffing these into the guts of vm_page_busy_sleep(), since I suspect we wouldn't care about them at all in the non-vm_fault() case, and in fact they could be misleading. Would another option be to conditionally compile the counters only for INVARIANTS (or DIAGNOSTIC)?

Or it might be better to keep this patch around without submitting it anyway. Results from buildworld:

vm.stats.fault.vmpfw_exclusive: 2048
vm.stats.fault.vmpfw_shared_cow: 2
vm.stats.fault.vmpfw_shared_read: 0

This is a small machine (2-vcpu guest), without Jeff's concurrency improvements. But not encouraging so far.

However, if this is concerning the pmap discussion, I have to point you at https://reviews.freebsd.org/D26011 . While there is no consensus which way to go with that patch, there is a way to gather all these statistics with dtrace while running the load and without the above.

Regardless of what happens here, I think evaluating vmpfw will have to wait until someone sorts out the object concurrency branch as that's the primary bottleneck masking at least some of the load which would get here.

In D28592#640134, @jah wrote:

In D28592#640118, @mjg wrote:

I have no comments about usefulness of adding these counters. I have to note the patch negatively affects scalability as it adds loads from the page form something which gets modified by other threads. If adding these stats, it would have to happen inside of vm_page_busy_sleep.

I kind of wanted to avoid stuffing these into the guts of vm_page_busy_sleep(), since I suspect we wouldn't care about them at all in the non-vm_fault() case, and in fact they could be misleading. Would another option be to conditionally compile the counters only for INVARIANTS (or DIAGNOSTIC)?

Or it might be better to keep this patch around without submitting it anyway. Results from buildworld:

vm.stats.fault.vmpfw_exclusive: 2048
vm.stats.fault.vmpfw_shared_cow: 2
vm.stats.fault.vmpfw_shared_read: 0

This is a small machine (2-vcpu guest), without Jeff's concurrency improvements. But not encouraging so far.

It is very likely at least some of the busying, shared or exclusive, can be avoided. Some of the exclusive busying can be probably downgraded.

Even if some form of waiting has to be performed, the code can do adaptive spinning instead of going off cpu.

Regardless of the above, some of the currently present fcmpset loops can probably be replaced with fetchadd.

All in all, I would be very surprised if there were no significant wins here to get at 100-ish thread scale.

All of this has to wait for the major bottleneck to be sorted out though.

In D28592#640134, @jah wrote:

Or it might be better to keep this patch around without submitting it anyway. Results from buildworld:

vm.stats.fault.vmpfw_exclusive: 2048
vm.stats.fault.vmpfw_shared_cow: 2
vm.stats.fault.vmpfw_shared_read: 0

This is a small machine (2-vcpu guest), without Jeff's concurrency improvements. But not encouraging so far.

The 2050 total cases for the whole buildworld shades the doubts on the claim that "vmpfw" wait state is dominating e.g. buildworld load. Might be indeed, the problem is that the configuration is two-CPU, and that it is under VM (VM tends to provide scheduling slew, which nicely exposes races not seen on real hw, but is not useful for benchmarking). But you need to get some load where busy wait in vm_fault() is the problem.

The claim was that vm object domaintes off cpu time and limits concurrent traffic leading to vmpfw.

Here is fresh head doing -j 104 buildkernel on tmpfs: https://people.freebsd.org/~mjg/fg/flix1-buildkernel-offcpu.svg

vm object is at 74.24% and vmpfw at 8.30%, the latter will go up significantly after the former gets sorted out.

Revision Contents
Changeset List

Path

Size

sys/

vm/

vm_fault.c

49 lines

Diff 83701

View Options

Add basic stats for vm_fault() busy-page waitsNeeds ReviewPublicActions