Page MenuHomeFreeBSD

amd64 pmap: per-domain pv chunk list

Authored by mjg on Oct 11 2019, 12:37 AM.



This has a side effect of reducing contention of vm object (since pmap locks are held with vm object lock). This is also preparation for allowing unused chunks to hang around on LRU lists to reduce creation/destruction. I did not try to make them per-cpu as it is due to the current reclamation scheme. I'm not fond of the way the physical addr is found. Basic ideas how to solve it are:

  • encode the domain in lower bits of the pointer
  • reduce the number of chunks by 1

This is 90 minutes of poudriere -j 104 with some other local changes + markj's atomic queue state patch (read: no vm page contention)

headper-domain pv chunk
1940930217 (rw:vm object)1477941752 (sx:vm map (user))
1767248576 (sx:proctree)1455213049 (rw:vm object)
1457914859 (sx:vm map (user))1431832545 (sx:proctree)
1027485418 (sleep mutex:VM reserv domain)829714675 (rw:pmap pv list)
916225793 (rw:pmap pv list)827757829 (sleep mutex:VM reserv domain)
650061753 (sleep mutex:ncvn)549093363 (sleep mutex:vm active pagequeue)
579930729 (sleep mutex:vm active pagequeue)543907775 (sleep mutex:ncvn)
500588125 (sleep mutex:pfs_vncache)529903985 (sleep mutex:process lock)
483413129 (sleep mutex:pmap pv chunk list)510587707 (sleep mutex:pfs_vncache)
470146522 (sleep mutex:process lock)416087302 (sleep mutex:vm page free queue)
438331739 (sleep mutex:vm page free queue)372820786 (lockmgr:tmpfs)
432400293 (lockmgr:tmpfs)341279654 (sleep mutex:struct mount vlist mtx)
374447164 (lockmgr:zfs)309354907 (lockmgr:zfs)
324149303 (sleep mutex:struct mount vlist mtx)257888425 (spin mutex:sleepq chain)
250128575 (spin mutex:sleepq chain)244137628 (sleep mutex:vnode interlock)
228014749 (sleep mutex:vnode interlock)231878487 (sleep mutex:vnode_free_list)
221631639 (sleep mutex:vnode_free_list)181800839 (sleep mutex:pmap pv chunk list)

Diff Detail

rS FreeBSD src repository
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

mjg created this revision.Oct 11 2019, 12:37 AM
mjg edited the summary of this revision. (Show Details)Oct 11 2019, 12:38 AM
mjg added reviewers: alc, kib, markj, jeff.
mjg edited the summary of this revision. (Show Details)Oct 11 2019, 12:40 AM
mjg edited the summary of this revision. (Show Details)
alc added inline comments.Oct 11 2019, 2:20 PM
430–431 ↗(On Diff #63139)

You can simplify this to:

return (_vm_phys_domain(DMAP_TO_PHYS((vm_offset_t)pc));
mjg updated this revision to Diff 63152.Oct 11 2019, 3:13 PM
  • simplify pc_to_domain
  • when initializing lists go for PMAP_MEMDOM instead of looping
  • fix the comment about initializing pmap pv chunk list
alc added inline comments.Oct 11 2019, 3:37 PM
450 ↗(On Diff #63139)

Hijacking this review for a second, the condition here (and in related code below) should not be VM_NRESERVLEVEL > 0, because we use the pv locks regardless of whether superpage reservations are enabled. In other words, someone, who disables reservations, but has a large number of processors, still deserves the additional locks. Disabling reservations only affects the use of the struct md_page, specifically, we'll never insert anything into the pv_list. So, conditioning this code on NUMA would be better.

mjg added inline comments.Oct 11 2019, 3:42 PM
450 ↗(On Diff #63139)

I mostly agree and can change that no problem (it's basically some churn). The question is if it can wait until after this one settles. If not, I will create a separate review and then rebase this patch.

In the meantime, do you have further comments about the change as proposed?

kib added inline comments.Oct 11 2019, 5:07 PM
4385 ↗(On Diff #63152)

So you are only visiting for reclamation domains which are allowed by the curthread policy ? IMO it is wrong, if we are at the stage where reclaim_pv_chunk is called, all means we must get to a fresh pv chunk.

mjg added inline comments.Oct 11 2019, 5:16 PM
4385 ↗(On Diff #63152)

I'm fine either way. But in this case, do we want to walk "our" domains first and only then iterate the rest? Or just walk everything as it is starting from 0.

mjg added inline comments.Oct 11 2019, 5:19 PM
4385 ↗(On Diff #63152)

How about this: simple rotor is added, threads fetchadd into it and they walk the list indicated by the new count % vm_ndomains. That way they spread themselves in case of multiple CPUs getting here. But this also increases likelyhood of getting a page from the "wrong" domain, which may not be a problem given the circumstances.

kib added inline comments.Oct 11 2019, 5:22 PM
4385 ↗(On Diff #63152)

You can start from the domain of the page, instead from the counter % ndomains.

alc added a comment.Oct 11 2019, 5:38 PM

I am interested to know how the lock profile changes if you perform the following commands (in order) before starting the poudriere:

$ cc -v
$ dd if=`which cc` of=/dev/null 
450 ↗(On Diff #63139)

I've only glanced at this change. I should be able to carefully review it within the next 36 hours.

mjg added a comment.EditedOct 11 2019, 5:58 PM

This would be a NOP in this workload. I see I did not mention in this review, but the setup is as follows: there are n worker jails, each with a tmpfs-based world. Each jail comes with a cpuset binding it to only one domain and tmpfs itself is populated while bound to said domain.

I presume you are after reduced fragmentation so that more superpages can be used, in particular for huge (and frequently used) binaries like the compiler.

In a much less involved test where the machine is freeshly booted and I just mount tmpfs and buildkernel in it, everything is fine until I unmount and start from scratch. From that point there is a significant performance drop stemming from constant pv relocking in pmap while the (highly contended) vm object lock is held. The following toy patch takes care of the problem for me, but it's only to prove it's the fragmentation which creates the problem. I don't know how the real fix would look like.

diff --git a/sys/fs/tmpfs/tmpfs_vnops.c b/sys/fs/tmpfs/tmpfs_vnops.c
index 4bbf3485909f..a234aefae0fa 100644
--- a/sys/fs/tmpfs/tmpfs_vnops.c
+++ b/sys/fs/tmpfs/tmpfs_vnops.c
@@ -302,6 +302,7 @@ tmpfs_open(struct vop_open_args *v)
 		KASSERT(vp->v_type != VREG || (node->tn_reg.tn_aobj->flags &
 		    OBJ_DEAD) == 0, ("dead object"));
 		vnode_create_vobject(vp, node->tn_size, v->a_td);
+		vm_object_color(vp->v_object, 0);
mjg updated this revision to Diff 63162.Oct 11 2019, 7:35 PM
  • reclaim chunks from all domains
mjg updated this revision to Diff 63561.Tue, Oct 22, 11:07 PM
  • rebase
kib accepted this revision.Wed, Oct 23, 3:12 PM
This revision is now accepted and ready to land.Wed, Oct 23, 3:12 PM
This revision was automatically updated to reflect the committed changes.