As discussed with markj and jeff, the waste stemming from adding locks per superpage may be tolerable given a good enough win.
pv list locks are highly contended during poudriere -j 104. Results below are total wait times from 90 minutes of said workload with head as of r352837 + local patches.
|head| per-superpage lock| per-superpage lock + batching|
| **14750058915 (rw:pmap pv list)** | 3854385128 (sleep mutex:vm page) |3989607911 (sleep mutex:vm page) |
|3374286316 (sx:vm map (user)) | 2256786712 (rw:vm object)|2164843658 (sx:vm map (user)) |
|3331328547 (sleep mutex:vm page)| 2173768388 (sx:vm map (user))| 2043301274 (rw:vm object)|
|2605370237 (rw:vm object) | 1526533364 (sx:proctree) |1461144904 (sx:proctree) |
|1286594764 (sx:proctree) | **1346192588 (rw:pmap pv list)** |1040647132 (sleep mutex:VM reserv domain)|
|867052484 (sleep mutex:ncvn) |966399834 (sleep mutex:ncvn)|**926036395 (rw:pmap pv list)**|
|748340242 (sleep mutex:VM reserv domain)| 913893270 (sleep mutex:VM reserv domain)|617706321 (sleep mutex:ncvn)|
|498943272 (lockmgr:tmpfs) |780144491 (sleep mutex:pmap pv chunk list)|499182196 (sleep mutex:pfs_vncache) |
Combined with batching from D21832 this in my opinion provides a win with justifies the extra space.
Extra space can be reduced in 2 ways with minor work:
- the pointer array is avoidable. instead we can carve out part of KVA and use it as a sparse array
- there is no strict need to use a "full" 32-byte lock. instead we can hack around a smaller lock variant which preserves all semantics of mutexes and only takes 8 bytes (or even less with some hackery)
Preferably this would be a bit-sized spinlock embedded in pv_gen, but the code is doing a lot of work with it held including allocating memory for radix tree. Thus changing this would require significant surgery. I have a rough idea how to implement a lock which takes 2 bits and provides all the needed semantics, but it's way too hackish if it is only to be used for this purpose.