The patch adds individual blockable lock for each small page' pv list. OBM (one byte mutex). This way, there is no aliasing and no false blocking on other page pv list. For lock, I only need single byte in md_page. The turnstiles used to block are still shared per superpage/hash function as it was in the stock kernel, i.e. blocked thread gets wakeups from the siblings. This seems to occur rare (I do not have numbers).
For superpages promotion and demotion, I lock the whole run of small pages which constitutes the superpage. This was the largest hesitation, but I convinced myself that the code touches 512 pv entries anyway, so touching 512 bytes from vm_pages that are hot anyway is fine. For obscure semantic of pde pv lock, see the herald comment above pmap_pv_list_lock_pde().
Unfortunate is that I cannot easily implement shared mode for OBM, or at least it would need a 'saturated' shared mode when there are more active readers than the space to count them. It is unfortunate because I think with fine-grained locking there is actually the opportunity to really see shared mode utilized for good.
OBM would also benefit from the lock debugging and profiling support.
This is extracted from my WIP branch for pv lists handling rework. Apparently make UMA to outperform highly-optimized specialized pv chunk allocator is hard, and arguably not needed.
Next, I eliminate the global (per-domain) pv chunk list lock. For this, a note is that we can get away with the global pmap list and rotate both pmap list and chunks list in the pmap to approach pc_lru ordering. My reasonable belief is that strict LRU does not matter if we are in situation where chunks reclamation is needed.
I do not have large machine, on 16 CPU / 32 threads box, with buildkernel -s -j 40 over tmpfs objdir, I got
stock 237.53 real 6782.41 user 567.74 sys 238.08 real 6769.18 user 569.86 sys 237.39 real 6783.97 user 570.91 sys patched 227.86 real 6730.43 user 307.43 sys 226.40 real 6737.99 user 304.76 sys 227.33 real 6733.44 user 305.36 sys
(look for sys time).
On pig1 with NUMA enabled, the results are not that dramatic. This needs to be checked on large box.
stock 116.11 real 7901.83 user 446.20 sys 116.03 real 7945.72 user 447.52 sys 117.55 real 7902.05 user 450.55 sys pig1 (only obm): 114.19 real 7879.51 user 438.05 sys 116.39 real 7882.03 user 436.02 sys pig1 (obm + removal of pvc locks): 115.82 real 7911.22 user 429.60 sys 116.64 real 7898.30 user 433.64 sys 115.68 real 7918.47 user 430.82 sys
Tested by: pho