Page MenuHomeFreeBSD

amd64 pmap: per-domain pv chunk list
ClosedPublic

Authored by mjg on Oct 11 2019, 12:37 AM.
Tags
None
Referenced Files
F107998246: D21976.id63603.diff
Mon, Jan 20, 8:56 AM
F107998241: D21976.id63139.diff
Mon, Jan 20, 8:56 AM
F107998237: D21976.id63561.diff
Mon, Jan 20, 8:56 AM
F107998229: D21976.id63152.diff
Mon, Jan 20, 8:56 AM
F107998199: D21976.id.diff
Mon, Jan 20, 8:56 AM
F107998191: D21976.id63162.diff
Mon, Jan 20, 8:55 AM
F107988663: D21976.diff
Mon, Jan 20, 6:20 AM
Unknown Object (File)
Sun, Jan 12, 7:51 PM
Subscribers

Details

Summary

This has a side effect of reducing contention of vm object (since pmap locks are held with vm object lock). This is also preparation for allowing unused chunks to hang around on LRU lists to reduce creation/destruction. I did not try to make them per-cpu as it is due to the current reclamation scheme. I'm not fond of the way the physical addr is found. Basic ideas how to solve it are:

  • encode the domain in lower bits of the pointer
  • reduce the number of chunks by 1

This is 90 minutes of poudriere -j 104 with some other local changes + markj's atomic queue state patch (read: no vm page contention)

headper-domain pv chunk
1940930217 (rw:vm object)1477941752 (sx:vm map (user))
1767248576 (sx:proctree)1455213049 (rw:vm object)
1457914859 (sx:vm map (user))1431832545 (sx:proctree)
1027485418 (sleep mutex:VM reserv domain)829714675 (rw:pmap pv list)
916225793 (rw:pmap pv list)827757829 (sleep mutex:VM reserv domain)
650061753 (sleep mutex:ncvn)549093363 (sleep mutex:vm active pagequeue)
579930729 (sleep mutex:vm active pagequeue)543907775 (sleep mutex:ncvn)
500588125 (sleep mutex:pfs_vncache)529903985 (sleep mutex:process lock)
483413129 (sleep mutex:pmap pv chunk list)510587707 (sleep mutex:pfs_vncache)
470146522 (sleep mutex:process lock)416087302 (sleep mutex:vm page free queue)
438331739 (sleep mutex:vm page free queue)372820786 (lockmgr:tmpfs)
432400293 (lockmgr:tmpfs)341279654 (sleep mutex:struct mount vlist mtx)
374447164 (lockmgr:zfs)309354907 (lockmgr:zfs)
324149303 (sleep mutex:struct mount vlist mtx)257888425 (spin mutex:sleepq chain)
250128575 (spin mutex:sleepq chain)244137628 (sleep mutex:vnode interlock)
228014749 (sleep mutex:vnode interlock)231878487 (sleep mutex:vnode_free_list)
221631639 (sleep mutex:vnode_free_list)181800839 (sleep mutex:pmap pv chunk list)

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

mjg added reviewers: alc, kib, markj, jeff.
mjg edited the summary of this revision. (Show Details)
sys/amd64/amd64/pmap.c
435–436

You can simplify this to:

return (_vm_phys_domain(DMAP_TO_PHYS((vm_offset_t)pc));
  • simplify pc_to_domain
  • when initializing lists go for PMAP_MEMDOM instead of looping
  • fix the comment about initializing pmap pv chunk list
sys/amd64/amd64/pmap.c
453

Hijacking this review for a second, the condition here (and in related code below) should not be VM_NRESERVLEVEL > 0, because we use the pv locks regardless of whether superpage reservations are enabled. In other words, someone, who disables reservations, but has a large number of processors, still deserves the additional locks. Disabling reservations only affects the use of the struct md_page, specifically, we'll never insert anything into the pv_list. So, conditioning this code on NUMA would be better.

sys/amd64/amd64/pmap.c
453

I mostly agree and can change that no problem (it's basically some churn). The question is if it can wait until after this one settles. If not, I will create a separate review and then rebase this patch.

In the meantime, do you have further comments about the change as proposed?

sys/amd64/amd64/pmap.c
4393

So you are only visiting for reclamation domains which are allowed by the curthread policy ? IMO it is wrong, if we are at the stage where reclaim_pv_chunk is called, all means we must get to a fresh pv chunk.

sys/amd64/amd64/pmap.c
4393

I'm fine either way. But in this case, do we want to walk "our" domains first and only then iterate the rest? Or just walk everything as it is starting from 0.

sys/amd64/amd64/pmap.c
4393

How about this: simple rotor is added, threads fetchadd into it and they walk the list indicated by the new count % vm_ndomains. That way they spread themselves in case of multiple CPUs getting here. But this also increases likelyhood of getting a page from the "wrong" domain, which may not be a problem given the circumstances.

sys/amd64/amd64/pmap.c
4393

You can start from the domain of the page, instead from the counter % ndomains.

I am interested to know how the lock profile changes if you perform the following commands (in order) before starting the poudriere:

$ cc -v
$ dd if=`which cc` of=/dev/null 
sys/amd64/amd64/pmap.c
453

I've only glanced at this change. I should be able to carefully review it within the next 36 hours.

This would be a NOP in this workload. I see I did not mention in this review, but the setup is as follows: there are n worker jails, each with a tmpfs-based world. Each jail comes with a cpuset binding it to only one domain and tmpfs itself is populated while bound to said domain.

I presume you are after reduced fragmentation so that more superpages can be used, in particular for huge (and frequently used) binaries like the compiler.

In a much less involved test where the machine is freeshly booted and I just mount tmpfs and buildkernel in it, everything is fine until I unmount and start from scratch. From that point there is a significant performance drop stemming from constant pv relocking in pmap while the (highly contended) vm object lock is held. The following toy patch takes care of the problem for me, but it's only to prove it's the fragmentation which creates the problem. I don't know how the real fix would look like.

diff --git a/sys/fs/tmpfs/tmpfs_vnops.c b/sys/fs/tmpfs/tmpfs_vnops.c
index 4bbf3485909f..a234aefae0fa 100644
--- a/sys/fs/tmpfs/tmpfs_vnops.c
+++ b/sys/fs/tmpfs/tmpfs_vnops.c
@@ -302,6 +302,7 @@ tmpfs_open(struct vop_open_args *v)
 		KASSERT(vp->v_type != VREG || (node->tn_reg.tn_aobj->flags &
 		    OBJ_DEAD) == 0, ("dead object"));
 		vnode_create_vobject(vp, node->tn_size, v->a_td);
+		vm_object_color(vp->v_object, 0);
 	}
 
 	MPASS(VOP_ISLOCKED(vp));
  • reclaim chunks from all domains
This revision is now accepted and ready to land.Oct 23 2019, 3:12 PM
This revision was automatically updated to reflect the committed changes.