As discussed with markj and jeff, the waste stemming from adding locks per superpage may be tolerable given a good enough win.
pv list locks are highly contended during poudriere -j 104. Results below are total wait times from 90 minutes of said workload with head as of r352837 + local patches.
| head | per-superpage lock | per-superpage lock + batching | 
| 14750058915 (rw:pmap pv list) | 3854385128 (sleep mutex:vm page) | 3989607911 (sleep mutex:vm page) | 
| 3374286316 (sx:vm map (user)) | 2256786712 (rw:vm object) | 2164843658 (sx:vm map (user)) | 
| 3331328547 (sleep mutex:vm page) | 2173768388 (sx:vm map (user)) | 2043301274 (rw:vm object) | 
| 2605370237 (rw:vm object) | 1526533364 (sx:proctree) | 1461144904 (sx:proctree) | 
| 1286594764 (sx:proctree) | 1346192588 (rw:pmap pv list) | 1040647132 (sleep mutex:VM reserv domain) | 
| 867052484 (sleep mutex:ncvn) | 966399834 (sleep mutex:ncvn) | 926036395 (rw:pmap pv list) | 
| 748340242 (sleep mutex:VM reserv domain) | 913893270 (sleep mutex:VM reserv domain) | 617706321 (sleep mutex:ncvn) | 
| 498943272 (lockmgr:tmpfs) | 780144491 (sleep mutex:pmap pv chunk list) | 499182196 (sleep mutex:pfs_vncache) | 
That's about 10 x reduction for per-superpage locking and 16 x reduction when combined with batching ( D21832 ).
This in my opinion provides a win with justifies the extra space.
Extra space can be reduced in 2 ways with minor work:
- the pointer array is avoidable. instead we can carve out part of KVA and use it as a sparse array
- there is no strict need to use a "full" 32-byte lock. instead we can hack around a smaller lock variant which preserves all semantics of mutexes and only takes 8 bytes (or even less with some hackery)
There is a small wart here: two segments at the beginning are disjoint in terms of physical address space, but they both fit in the same superpage. The loop as implemented thus wastes one allocation to accomdate the struct which is never used. This can be easily fixed. However, I don't think trying to accomodate for anything of the sort in later segments is warranted (it's more bug-prone than beneficial). There are no correctness problems stemming from this, worst case there is leaked memory.
Preferably this would be a bit-sized spinlock embedded in pv_gen, but the code is doing a lot of work with it held including allocating memory for radix tree. Thus changing this would require significant surgery. I have a rough idea how to implement a lock which takes 2 bits and provides all the needed semantics (including priority propagation), but it's way too hackish if it is only to be used for this purpose.