MFC r313269,r313270,r313271,r313272,r313274,r313278,r313279,r313996,r314474
mtx: switch to fcmpset
The found value is passed to locking routines in order to reduce cacheline
accesses.
mtx_unlock grows an explicit check for regular unlock. On ll/sc architectures
the routine can fail even if the lock could have been handled by the inline
primitive.
rwlock: switch to fcmpset
sx: switch to fcmpset
sx: uninline slock/sunlock
Shared locking routines explicitly read the value and test it. If the
change attempt fails, they fall back to a regular function which would
retry in a loop.
The problem is that with many concurrent readers the risk of failure is pretty
high and even the value returned by fcmpset is very likely going to be stale
by the time the loop in the fallback routine is reached.
Uninline said primitives. It gives a throughput increase when doing concurrent
slocks/sunlocks with 80 hardware threads from ~50 mln/s to ~56 mln/s.
Interestingly, rwlock primitives are already not inlined.
sx: add witness support missed in r313272
mtx: fix up _mtx_obtain_lock_fetch usage in thread lock
Since _mtx_obtain_lock_fetch no longer sets the argument to MTX_UNOWNED,
callers have to do it on their own.
mtx: fixup r313278, the assignemnt was supposed to go inside the loop
mtx: fix spin mutexes interaction with failed fcmpset
While doing so move recursion support down to the fallback routine.
locks: ensure proper barriers are used with atomic ops when necessary
Unclear how, but the locking routine for mutexes was using the *release*
barrier instead of acquire. This must have been either a copy-pasto or bad
completion.
Going through other uses of atomics shows no barriers in:
- upgrade routines (addressed in this patch)
- sections protected with turnstile locks - this should be fine as necessary
barriers are in the worst case provided by turnstile unlock
I would like to thank Mark Millard and andreast@ for reporting the problem and
testing previous patches before the issue got identified.