The acquisition and release of an uncontended default/normal pthread mutex
on FreeBSD is suprisingly slow, e.g., pthread wrlocks and binary semaphores
both exhibit roughly 33% lower latency, while default/normal mutexes
on that other OS exhibit roughly 67% lower latency than FreeBSD. This
might be explained by the fact that (AFAICT) in the best case to acquire
an uncontended mutex on the other OS it need touch only 1 page and read/modify
only 1 cacheline, whereas on FreeBSD we need to touch at least 4 pages,
read 6 cachelines, and modify at least 4 cachelines.
This patch does not address the pthread mutex architecture. Instead, it
improves performance by adding the __always_inline attribute to mutex_lock_common()
and mutex_unlock_common() to encourage constant folding and propagation, thereby
lowering the latency to acquire and release a mutex due to a shorter code
path with fewer compares, jumps, and mispredicts.
With this patch on a stock build I see a reduction in latency of roughly 7%
for default/normal mutexes, and 17% for robust mutexes. When built without
PTHREADS_ASSERTIONS enabled I see a reduction in latency of roughly 15% and
26%, respectively. Suprisingly, I see similar reductions in latency
for heavily contended mutexes.
By default, this patch increases the size of libthr.so.3 by 2448 bytes, but
when built without PTHREAD_ASSERTIONS enabled it only increases by 448 bytes.