Page MenuHomeFreeBSD

libthr: Patch to reduce latency to acquire+release a pthread mutex
ClosedPublic

Authored by becker.greg_att.net on Jul 7 2023, 4:01 PM.
Referenced Files
F83979455: D40912.diff
Fri, May 17, 6:01 PM
Unknown Object (File)
Sun, May 12, 11:06 AM
Unknown Object (File)
Sat, May 11, 1:29 AM
Unknown Object (File)
Thu, May 9, 8:27 PM
Unknown Object (File)
Sat, Apr 20, 3:16 PM
Unknown Object (File)
Mar 12 2024, 7:50 PM
Unknown Object (File)
Mar 12 2024, 7:50 PM
Unknown Object (File)
Mar 12 2024, 7:50 PM

Details

Summary

The acquisition and release of an uncontended default/normal pthread mutex
on FreeBSD is suprisingly slow, e.g., pthread wrlocks and binary semaphores
both exhibit roughly 33% lower latency, while default/normal mutexes
on that other OS exhibit roughly 67% lower latency than FreeBSD. This
might be explained by the fact that (AFAICT) in the best case to acquire
an uncontended mutex on the other OS it need touch only 1 page and read/modify
only 1 cacheline, whereas on FreeBSD we need to touch at least 4 pages,
read 6 cachelines, and modify at least 4 cachelines.

This patch does not address the pthread mutex architecture. Instead, it
improves performance by adding the __always_inline attribute to mutex_lock_common()
and mutex_unlock_common() to encourage constant folding and propagation, thereby
lowering the latency to acquire and release a mutex due to a shorter code
path with fewer compares, jumps, and mispredicts.

With this patch on a stock build I see a reduction in latency of roughly 7%
for default/normal mutexes, and 17% for robust mutexes. When built without
PTHREADS_ASSERTIONS enabled I see a reduction in latency of roughly 15% and
26%, respectively. Suprisingly, I see similar reductions in latency
for heavily contended mutexes.

By default, this patch increases the size of libthr.so.3 by 2448 bytes, but
when built without PTHREAD_ASSERTIONS enabled it only increases by 448 bytes.

Test Plan

kyua test -k /usr/tests/lib/libthr/Kyuafile
kyua test -k /usr/tests/Kyuafile

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

This revision is now accepted and ready to land.Jul 7 2023, 4:29 PM

Please, when uploading patches to phabricator, do git diff -U99999999, to provide enough context. I will reply to the patch after you re-upload it.

Updated diff to provide full context via diff -U99999999 ...

This revision now requires review to proceed.Jul 7 2023, 4:47 PM
lib/libthr/thread/thr_private.h
172 ↗(On Diff #124292)

This changes layout of the struct, which means that shared mutexes from different versions of libthr no longer work.

At very least, m_qidx should be put at the end of the structure, I suspect. But that would make layout incompatible between 32- and 64-bit variants of the libthr. Might be, put it after m_ps?

becker.greg_att.net edited the summary of this revision. (Show Details)

I reverted all the m_qidx changes, lost a wee bit of perf across all mutex flavors, but numbers still look pretty good.

lib/libthr/thread/thr_private.h
172 ↗(On Diff #124292)

Good catch! I think at the end of the struct is correct - after m_ps would cause the offset of m_qe to be off-by-4 between ILP32 and LP64, right?

Regardless, consider that only the attributes from the process who first initializes the pshared mutex will be used. If the winning process is using the old flavor (i.e., without m_qidx), then a process that attaches to that pshared mutex using the new flavor will not see that m_qidx is properly initialized (it will probably accidentally be correct in most cases).

At this point I don't think it's worth the extra complexity to try and make this work, so I think I am going to revert all the m_qidx changes and repost the diff.

lib/libthr/thread/thr_private.h
172 ↗(On Diff #124292)

Yes, adding fields that would for for shared pthread_mutexes is somewhat tricky. Still, we sometimes need to do it. Might be, some magic number should be placed somewhere.

Note that members after m_ps are only used by thread owning the lock, locally. The members are the links among locked mutexes, and robust pointer. Members before m_ps and m_ps are 32/64bit invariant.

This revision is now accepted and ready to land.Jul 8 2023, 3:56 AM