The patch adds individual blockable lock for each small page' pv list. OBM (one byte mutex). This way, there is no aliasing and no false blocking on other page pv list. For lock, I only need single byte in md_page. The turnstiles used to block are still shared per superpage/hash function as it was in the stock kernel, i.e. blocked thread gets wakeups from the siblings. This seems to occur rare (I do not have numbers).
Unfortunate is that I cannot easily implement shared mode for OBM, or at least it would need a 'saturated' shared mode when there are more active readers than the space to count them. It is unfortunate because I think with fine-grained locking there is actually the opportunity to really see shared mode utilized for good.
OBM would also benefit from the lock debugging and profiling support.
This is extracted from my WIP branch for pv lists handling rework. Apparently make UMA to outperform highly-optimized specialized pv chunk allocator is hard, and arguably not needed.
I plan to eliminate the global (per-domain) pv chunk list lock next. For this, a note is that we can get away with the global pmap list and rotate both pmap list and chunks list in the pmap to approach pc_lru ordering. My reasonable belief is that strict LRU does not matter if we are in situation where chunks reclamation is needed.
I do not have large machine, on 16 CPU / 32 threads box, with `buildkernel -s -j 40` over tmpfs objdir, I got
```
stock
237.53 real 6782.41 user 567.74 sys
238.08 real 6769.18 user 569.86 sys
237.39 real 6783.97 user 570.91 sys
patched
227.86 real 6730.43 user 307.43 sys
226.40 real 6737.99 user 304.76 sys
227.33 real 6733.44 user 305.36 sys
```