Change Details

As discussed with markj and jeff, the waste stemming from adding locks per superpage may be tolerable given a good enough win. pv list locks are highly contended during poudriere -j 104. Results below are total wait times from 90 minutes of said workload with head as of r352837 + local patches. |head| per-superpage lock| per-superpage lock + batching| | **14750058915 (rw:pmap pv list)** | 3854385128 (sleep mutex:vm page) |3989607911 (sleep mutex:vm page) | |3374286316 (sx:vm map (user)) | 2256786712 (rw:vm object)|2164843658 (sx:vm map (user)) | |3331328547 (sleep mutex:vm page)| 2173768388 (sx:vm map (user))| 2043301274 (rw:vm object)| |2605370237 (rw:vm object) | 1526533364 (sx:proctree) |1461144904 (sx:proctree) | |1286594764 (sx:proctree) | **1346192588 (rw:pmap pv list)** |1040647132 (sleep mutex:VM reserv domain)| |867052484 (sleep mutex:ncvn) |966399834 (sleep mutex:ncvn)|**926036395 (rw:pmap pv list)**| |748340242 (sleep mutex:VM reserv domain)| 913893270 (sleep mutex:VM reserv domain)|617706321 (sleep mutex:ncvn)| |498943272 (lockmgr:tmpfs) |780144491 (sleep mutex:pmap pv chunk list)|499182196 (sleep mutex:pfs_vncache) | Combined with batching from D21832 this in my opinion provides a win with justifies the extra space. Extra space can be reduced in 2 ways with minor work: - the pointer array is avoidable. instead we can carve out part of KVA and use it as a sparse array - there is no strict need to use a "full" 32-byte lock. instead we can hack around a smaller lock variant which preserves all semantics of mutexes and only takes 8 bytes (or even less with some hackery) Preferably this would be a bit-sized spinlock embedded in pv_gen, but the code is doing a lot of work with it held including allocating memory for radix tree. Thus changing this would require significant surgery. I have a rough idea how to implement a lock which takes 2 bits and provides all the needed semantics, but it's way too hackish if it is only to be used for this purpose.

As discussed with markj and jeff, the waste stemming from adding locks per superpage may be tolerable given a good enough win. pv list locks are highly contended during poudriere -j 104. Results below are total wait times from 90 minutes of said workload with head as of r352837 + local patches. |head| per-superpage lock| per-superpage lock + batching| | **14750058915 (rw:pmap pv list)** | 3854385128 (sleep mutex:vm page) |3989607911 (sleep mutex:vm page) | |3374286316 (sx:vm map (user)) | 2256786712 (rw:vm object)|2164843658 (sx:vm map (user)) | |3331328547 (sleep mutex:vm page)| 2173768388 (sx:vm map (user))| 2043301274 (rw:vm object)| |2605370237 (rw:vm object) | 1526533364 (sx:proctree) |1461144904 (sx:proctree) | |1286594764 (sx:proctree) | **1346192588 (rw:pmap pv list)** |1040647132 (sleep mutex:VM reserv domain)| |867052484 (sleep mutex:ncvn) |966399834 (sleep mutex:ncvn)|**926036395 (rw:pmap pv list)**| |748340242 (sleep mutex:VM reserv domain)| 913893270 (sleep mutex:VM reserv domain)|617706321 (sleep mutex:ncvn)| |498943272 (lockmgr:tmpfs) |780144491 (sleep mutex:pmap pv chunk list)|499182196 (sleep mutex:pfs_vncache) | That's about 10 x reduction for per-superpage locking and 16 x reduction when combined with batching ( D21832 ). This in my opinion provides a win with justifies the extra space. Extra space can be reduced in 2 ways with minor work: - the pointer array is avoidable. instead we can carve out part of KVA and use it as a sparse array - there is no strict need to use a "full" 32-byte lock. instead we can hack around a smaller lock variant which preserves all semantics of mutexes and only takes 8 bytes (or even less with some hackery) Preferably this would be a bit-sized spinlock embedded in pv_gen, but the code is doing a lot of work with it held including allocating memory for radix tree. Thus changing this would require significant surgery. I have a rough idea how to implement a lock which takes 2 bits and provides all the needed semantics, but it's way too hackish if it is only to be used for this purpose.

As discussed with markj and jeff, the waste stemming from adding locks per superpage may be tolerable given a good enough win. pv list locks are highly contended during poudriere -j 104. Results below are total wait times from 90 minutes of said workload with head as of r352837 + local patches. |head| per-superpage lock| per-superpage lock + batching| | **14750058915 (rw:pmap pv list)** | 3854385128 (sleep mutex:vm page) |3989607911 (sleep mutex:vm page) | |3374286316 (sx:vm map (user)) | 2256786712 (rw:vm object)|2164843658 (sx:vm map (user)) | |3331328547 (sleep mutex:vm page)| 2173768388 (sx:vm map (user))| 2043301274 (rw:vm object)| |2605370237 (rw:vm object) | 1526533364 (sx:proctree) |1461144904 (sx:proctree) | |1286594764 (sx:proctree) | **1346192588 (rw:pmap pv list)** |1040647132 (sleep mutex:VM reserv domain)| |867052484 (sleep mutex:ncvn) |966399834 (sleep mutex:ncvn)|**926036395 (rw:pmap pv list)**| |748340242 (sleep mutex:VM reserv domain)| 913893270 (sleep mutex:VM reserv domain)|617706321 (sleep mutex:ncvn)| |498943272 (lockmgr:tmpfs) |780144491 (sleep mutex:pmap pv chunk list)|499182196 (sleep mutex:pfs_vncache) | CThat's about 10 x reduction for per-superpage locking and 16 x reduction when combined with batching from( D21832 this). This in my opinion provides a win with justifies the extra space. Extra space can be reduced in 2 ways with minor work: - the pointer array is avoidable. instead we can carve out part of KVA and use it as a sparse array - there is no strict need to use a "full" 32-byte lock. instead we can hack around a smaller lock variant which preserves all semantics of mutexes and only takes 8 bytes (or even less with some hackery) Preferably this would be a bit-sized spinlock embedded in pv_gen, but the code is doing a lot of work with it held including allocating memory for radix tree. Thus changing this would require significant surgery. I have a rough idea how to implement a lock which takes 2 bits and provides all the needed semantics, but it's way too hackish if it is only to be used for this purpose.