Page MenuHomeFreeBSD

amd64 pmap: retire the global lock array in favor of per-superpage locks
ClosedPublic

Authored by mjg on Sep 28 2019, 7:02 PM.

Details

Summary

As discussed with markj and jeff, the waste stemming from adding locks per superpage may be tolerable given a good enough win.

pv list locks are highly contended during poudriere -j 104. Results below are total wait times from 90 minutes of said workload with head as of r352837 + local patches.

headper-superpage lockper-superpage lock + batching
14750058915 (rw:pmap pv list)3854385128 (sleep mutex:vm page)3989607911 (sleep mutex:vm page)
3374286316 (sx:vm map (user))2256786712 (rw:vm object)2164843658 (sx:vm map (user))
3331328547 (sleep mutex:vm page)2173768388 (sx:vm map (user))2043301274 (rw:vm object)
2605370237 (rw:vm object)1526533364 (sx:proctree)1461144904 (sx:proctree)
1286594764 (sx:proctree)1346192588 (rw:pmap pv list)1040647132 (sleep mutex:VM reserv domain)
867052484 (sleep mutex:ncvn)966399834 (sleep mutex:ncvn)926036395 (rw:pmap pv list)
748340242 (sleep mutex:VM reserv domain)913893270 (sleep mutex:VM reserv domain)617706321 (sleep mutex:ncvn)
498943272 (lockmgr:tmpfs)780144491 (sleep mutex:pmap pv chunk list)499182196 (sleep mutex:pfs_vncache)

That's about 10 x reduction for per-superpage locking and 16 x reduction when combined with batching ( D21832 ).
This in my opinion provides a win with justifies the extra space.

Extra space can be reduced in 2 ways with minor work:

  • the pointer array is avoidable. instead we can carve out part of KVA and use it as a sparse array
  • there is no strict need to use a "full" 32-byte lock. instead we can hack around a smaller lock variant which preserves all semantics of mutexes and only takes 8 bytes (or even less with some hackery)

There is a small wart here: two segments at the beginning are disjoint in terms of physical address space, but they both fit in the same superpage. The loop as implemented thus wastes one allocation to accomdate the struct which is never used. This can be easily fixed. However, I don't think trying to accomodate for anything of the sort in later segments is warranted (it's more bug-prone than beneficial). There are no correctness problems stemming from this, worst case there is leaked memory.

Preferably this would be a bit-sized spinlock embedded in pv_gen, but the code is doing a lot of work with it held including allocating memory for radix tree. Thus changing this would require significant surgery. I have a rough idea how to implement a lock which takes 2 bits and provides all the needed semantics (including priority propagation), but it's way too hackish if it is only to be used for this purpose.

Diff Detail

Repository
rS FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

mjg created this revision.Sep 28 2019, 7:02 PM
mjg edited the summary of this revision. (Show Details)Sep 28 2019, 7:05 PM
markj added a reviewer: alc.Sep 28 2019, 7:06 PM
mjg edited the summary of this revision. (Show Details)Sep 28 2019, 7:06 PM
mjg edited the summary of this revision. (Show Details)Sep 28 2019, 7:21 PM
jeff added inline comments.Sep 28 2019, 9:50 PM
sys/amd64/amd64/pmap.c
319 ↗(On Diff #62692)

I assume this define is dead.

1824 ↗(On Diff #62692)

I think you should fix this before this version goes in. There isn't any need for the double indirect.

Just use kva_alloc() to allocate the memory. Then you can populate it yourself.

kib added inline comments.Sep 29 2019, 2:49 PM
sys/amd64/amd64/pmap.c
1947 ↗(On Diff #62692)

You only need to do this if pg_ps is enabled ?

mjg added inline comments.Sep 29 2019, 9:31 PM
sys/amd64/amd64/pmap.c
319 ↗(On Diff #62692)

No, but it's user should probably be modified to use MAXCPU or similar:

static u_long pv_invl_gen[NPV_LIST_LOCKS];

1824 ↗(On Diff #62692)

I noted this may be an easy option, but that turned out to be too naive. WIth the size of 48 bytes per object (as in: not a power of 2) we land in situation where space for some superpages will end up overlapping with the previous domain and therefore will induce interconnect traffic to use. With the pointer array the waste is at 40 bytes (56 - 16 which are already used). Removing the array drops 8 bytes per entry, but does not help here.

Thus options which I see are as follows:

  • if we bump entries by 16 bytes to make them all fit in one cacheline we add 8 bytes of waste on top of the current solution for each superpage, but it does have an upshot of increasing scalability a little bit in that each entry would have its own cacheline.
  • we can decrease the size of each entry to 32 bytes, to fit two per line, but this would require hackery with locking primitives. so happens I'm working on integrating MCS locks (or CLH, have not decided yet) into spinlocks and it will come with surgery needed to support smaller locks while preserving all the current functionality. iow the hackery needed now can be avoided if this waits.
  • we can go back to not embedding the lock in the struct. then we allocate space separately, but this has worst characteristics in terms of performance - there are always 2 locks and 4 structs per line
  • someone(tm) merges this with reservation support. this is probably the preferable long term solution but also looks most time consuming (from cursory reading of the code that is). just embedding the struct looks trivial, but making reservations be backed by their respective nodes looks problematic in the current state. I may get around to it at some point.

That said, I think the patch should go in with the current method unless someone is going to take care of the reservation stuff in the foreseeable future (in which case I'm happy to drop this patch).

1947 ↗(On Diff #62692)

Both variants take these locks so I don't think it's avoidable.

jeff added inline comments.Sep 29 2019, 9:39 PM
sys/amd64/amd64/pmap.c
1824 ↗(On Diff #62692)

Right now you're going to be touching remote memory (n-1)/n times anyway to follow your indirect table. You will only write to local memory but you're still touching extra cache lines just to improve locality for a fraction of memory.

For the pagetable pages I just accept that the unaligned area will be imperfect. If you allocate page at a time you will have something like 4k/64 bytes per-super * 2mb super = 128MB of memory backed by locks that are in the wrong domain. That seems like a small cost to pay vs a double indirection.

Because the architecture has actually very strongly aligned physical memory, we actually could improve on this situation if we made tables that covered all of physical memory. For vm_page I believe the resulting alignment would work out to gigabyte boundaries. However this would waste the space needed to represent a lot of inaccessible memory.

markj added inline comments.Sep 29 2019, 10:12 PM
sys/amd64/amd64/pmap.c
1824 ↗(On Diff #62692)

The argument for strict affinity of the pv list heads and locks also presumes that accesses are likely to be local in the first place, which is not true in general when creating or destroying mappings of shared VM objects. For example, if the system libc.so .text is backed by a superpage we are going to get many non-local accesses anyway.

mjg updated this revision to Diff 62933.Oct 5 2019, 2:56 PM
  • keep the old code when VN_RESERVLEVEL == 0
  • allocate the area using kva_alloc. inserting pages with kmem_back_domain gives me a panic: vm_reserv_alloc_page: domain mismatch, which looks like a bug elsewhere and it may be worth fixing regardless of this patch
  • bump pmap_large_md_page to 64 bytes. I was somehow convinced md_page is 16 bytes, but it turned out to be 24. Adding the lock makes it 56. I added in invl gen counter as well (since it does have superpage granularity anyway) which makes it 64 bytes in total. Makes for a little simpler code and slight bump in performance.

I'm not fond of kva_alloc here (just like I was not of the array in the previous patch). I think the address of the area should be known at compilation time.

markj said he will look into changing the way reservations get allocated to distribute them per domain. Should this be completed, the patch as presented can be easily augmented to use the area.

mjg updated this revision to Diff 62934.Oct 5 2019, 3:09 PM
  • add the missing pvd->pv_invl_gen = 0;
kib added inline comments.Oct 5 2019, 6:54 PM
sys/amd64/amd64/pmap.c
321 ↗(On Diff #62934)

Can you add INVARIANTS versions of the macros which assert that we do not access beyond the array end ? You would need to save the array size somewhere.

1843 ↗(On Diff #62934)

Are you sure that int is enough there for all practical means ? It seems that overflow happens at PB, I do not see why not use (u_)longs there.

1844 ↗(On Diff #62934)

I think it is better to cat pv_npg instead of casting the result, i.e. remove the () around multiplication.

mjg updated this revision to Diff 62942.Oct 5 2019, 8:14 PM
  • change several vars to long type
  • assert PA within the range
jeff added inline comments.Oct 6 2019, 2:46 AM
sys/amd64/amd64/pmap.c
1835 ↗(On Diff #62942)

Do we need variants for reservation enabled kernels a not? It seems like the same code should work for both.

1886 ↗(On Diff #62942)

I don't understand why kmem_back_domain did not work here.

Maybe kib can comment about why pmap_qenter() works when pmap_enter() does not? It seems that it should.

mjg added inline comments.Oct 6 2019, 3:07 AM
sys/amd64/amd64/pmap.c
1835 ↗(On Diff #62942)

It does, but the old one uses less memory and there is no difficulty keeping both.

1886 ↗(On Diff #62942)

phabricator hid one of my previous updates:

the panic comes from vm_reserv_alloc_page:

rv = vm_reserv_from_object(object, pindex, mpred, &msucc);
if (rv != NULL) {
        KASSERT(object != kernel_object || rv->domain == domain,
            ("vm_reserv_alloc_page: domain mismatch"));

It happens on very first alloc from another domain. Looks like a bug in this code as it is clearly not prepared to get a different domain. Worked around by allocating without vm obj.

kib added inline comments.Oct 6 2019, 12:34 PM
sys/amd64/amd64/pmap.c
1842 ↗(On Diff #62942)

I believe j should be long as well. BTW, why not u_long ?

1910 ↗(On Diff #62942)

i should be long a well.

mjg updated this revision to Diff 62969.Oct 6 2019, 7:10 PM
  • assert (sizeof(*pvd) == 64)
  • change non-superpage i type to long
mjg added inline comments.Oct 6 2019, 7:11 PM
sys/amd64/amd64/pmap.c
1842 ↗(On Diff #62942)

int j gives coverage of 64TB for one segment, I think that's good enough for a long time and is a little nicer next to i.

signed long allows simpler loop with "highest = -1"

kib accepted this revision.Oct 6 2019, 7:13 PM
This revision is now accepted and ready to land.Oct 6 2019, 7:13 PM
This revision was automatically updated to reflect the committed changes.
alc added inline comments.Oct 7 2019, 6:17 PM
head/sys/amd64/amd64/pmap.c
325

It's not clear to me why this is conditioned on whether reservations are enabled, i.e., VM_NRESERVLEVEL > 0. I would have thought that this would be conditioned on whether NUMA support is enabled, or something related to the number of processors.