Details

Reviewers

alc
kib
markj
jeff

Commits

rS353960: amd64 pmap: per-domain pv chunk list

Summary

This has a side effect of reducing contention of vm object (since pmap locks are held with vm object lock). This is also preparation for allowing unused chunks to hang around on LRU lists to reduce creation/destruction. I did not try to make them per-cpu as it is due to the current reclamation scheme. I'm not fond of the way the physical addr is found. Basic ideas how to solve it are:

encode the domain in lower bits of the pointer
reduce the number of chunks by 1

This is 90 minutes of poudriere -j 104 with some other local changes + markj's atomic queue state patch (read: no vm page contention)

head	per-domain pv chunk
1940930217 (rw:vm object)	1477941752 (sx:vm map (user))
1767248576 (sx:proctree)	1455213049 (rw:vm object)
1457914859 (sx:vm map (user))	1431832545 (sx:proctree)
1027485418 (sleep mutex:VM reserv domain)	829714675 (rw:pmap pv list)
916225793 (rw:pmap pv list)	827757829 (sleep mutex:VM reserv domain)
650061753 (sleep mutex:ncvn)	549093363 (sleep mutex:vm active pagequeue)
579930729 (sleep mutex:vm active pagequeue)	543907775 (sleep mutex:ncvn)
500588125 (sleep mutex:pfs_vncache)	529903985 (sleep mutex:process lock)
483413129 (sleep mutex:pmap pv chunk list)	510587707 (sleep mutex:pfs_vncache)
470146522 (sleep mutex:process lock)	416087302 (sleep mutex:vm page free queue)
438331739 (sleep mutex:vm page free queue)	372820786 (lockmgr:tmpfs)
432400293 (lockmgr:tmpfs)	341279654 (sleep mutex:struct mount vlist mtx)
374447164 (lockmgr:zfs)	309354907 (lockmgr:zfs)
324149303 (sleep mutex:struct mount vlist mtx)	257888425 (spin mutex:sleepq chain)
250128575 (spin mutex:sleepq chain)	244137628 (sleep mutex:vnode interlock)
228014749 (sleep mutex:vnode interlock)	231878487 (sleep mutex:vnode_free_list)
221631639 (sleep mutex:vnode_free_list)	181800839 (sleep mutex:pmap pv chunk list)

Diff Detail

Lint

Lint Skipped

Unit

Tests Skipped

Event Timeline

mjg created this revision.Oct 11 2019, 12:37 AM

Herald added a subscriber: imp. · View Herald TranscriptOct 11 2019, 12:37 AM

Harbormaster completed remote builds in B26983: Diff 63139.Oct 11 2019, 12:37 AM

mjg edited the summary of this revision. (Show Details)Oct 11 2019, 12:38 AM

mjg added reviewers: alc, kib, markj, jeff.

mjg edited the summary of this revision. (Show Details)Oct 11 2019, 12:40 AM

mjg edited the summary of this revision. (Show Details)

alc added inline comments.Oct 11 2019, 2:20 PM

sys/amd64/amd64/pmap.c
435–436	You can simplify this to: return (_vm_phys_domain(DMAP_TO_PHYS((vm_offset_t)pc));

simplify pc_to_domain
when initializing lists go for PMAP_MEMDOM instead of looping
fix the comment about initializing pmap pv chunk list

alc added inline comments.Oct 11 2019, 3:37 PM

sys/amd64/amd64/pmap.c
453	Hijacking this review for a second, the condition here (and in related code below) should not be `VM_NRESERVLEVEL > 0`, because we use the pv locks regardless of whether superpage reservations are enabled. In other words, someone, who disables reservations, but has a large number of processors, still deserves the additional locks. Disabling reservations only affects the use of the struct md_page, specifically, we'll never insert anything into the pv_list. So, conditioning this code on `NUMA` would be better.

mjg added inline comments.Oct 11 2019, 3:42 PM

sys/amd64/amd64/pmap.c
453	I mostly agree and can change that no problem (it's basically some churn). The question is if it can wait until after this one settles. If not, I will create a separate review and then rebase this patch. In the meantime, do you have further comments about the change as proposed?

kib added inline comments.Oct 11 2019, 5:07 PM

sys/amd64/amd64/pmap.c
4393	So you are only visiting for reclamation domains which are allowed by the curthread policy ? IMO it is wrong, if we are at the stage where reclaim_pv_chunk is called, all means we must get to a fresh pv chunk.

mjg added inline comments.Oct 11 2019, 5:16 PM

sys/amd64/amd64/pmap.c
4393	I'm fine either way. But in this case, do we want to walk "our" domains first and only then iterate the rest? Or just walk everything as it is starting from 0.

mjg added inline comments.Oct 11 2019, 5:19 PM

sys/amd64/amd64/pmap.c
4393	How about this: simple rotor is added, threads fetchadd into it and they walk the list indicated by the new count % vm_ndomains. That way they spread themselves in case of multiple CPUs getting here. But this also increases likelyhood of getting a page from the "wrong" domain, which may not be a problem given the circumstances.

kib added inline comments.Oct 11 2019, 5:22 PM

sys/amd64/amd64/pmap.c
4393	You can start from the domain of the page, instead from the counter % ndomains.

I am interested to know how the lock profile changes if you perform the following commands (in order) before starting the poudriere:

$ cc -v
$ dd if=`which cc` of=/dev/null

sys/amd64/amd64/pmap.c
453	I've only glanced at this change. I should be able to carefully review it within the next 36 hours.

This would be a NOP in this workload. I see I did not mention in this review, but the setup is as follows: there are n worker jails, each with a tmpfs-based world. Each jail comes with a cpuset binding it to only one domain and tmpfs itself is populated while bound to said domain.

I presume you are after reduced fragmentation so that more superpages can be used, in particular for huge (and frequently used) binaries like the compiler.

In a much less involved test where the machine is freeshly booted and I just mount tmpfs and buildkernel in it, everything is fine until I unmount and start from scratch. From that point there is a significant performance drop stemming from constant pv relocking in pmap while the (highly contended) vm object lock is held. The following toy patch takes care of the problem for me, but it's only to prove it's the fragmentation which creates the problem. I don't know how the real fix would look like.

diff --git a/sys/fs/tmpfs/tmpfs_vnops.c b/sys/fs/tmpfs/tmpfs_vnops.c
index 4bbf3485909f..a234aefae0fa 100644
--- a/sys/fs/tmpfs/tmpfs_vnops.c
+++ b/sys/fs/tmpfs/tmpfs_vnops.c
@@ -302,6 +302,7 @@ tmpfs_open(struct vop_open_args *v)
 		KASSERT(vp->v_type != VREG || (node->tn_reg.tn_aobj->flags &
 		    OBJ_DEAD) == 0, ("dead object"));
 		vnode_create_vobject(vp, node->tn_size, v->a_td);
+		vm_object_color(vp->v_object, 0);
 	}
 
 	MPASS(VOP_ISLOCKED(vp));

reclaim chunks from all domains

rebase

kib accepted this revision.Oct 23 2019, 3:12 PM

This revision is now accepted and ready to land.Oct 23 2019, 3:12 PM

Closed by commit rS353960: amd64 pmap: per-domain pv chunk list (authored by mjg). · Explain WhyOct 23 2019, 7:17 PM

This revision was automatically updated to reflect the committed changes.

mjg added a commit: rS353960: amd64 pmap: per-domain pv chunk list.

amd64 pmap: per-domain pv chunk list
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 63561

sys/amd64/amd64/pmap.c

amd64 pmap: per-domain pv chunk listClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 63561

sys/amd64/amd64/pmap.c

amd64 pmap: per-domain pv chunk list
ClosedPublic
Actions

Revision Contents
Changeset List