MFC r313454,r313472:
rwlock: implemenet rlock/runlock fast path
This improves singlethreaded throughput on my test machine from ~247 mln
ops/s to ~328 mln.
It is mostly about avoiding the setup cost of lockstat.
rwlock: fix r313454
The runlock slow path would update wrong variable before restarting the
loop, in effect corrupting the state.