Page MenuHomeFreeBSD

libthr: Use a small cache to greatly reduce pshared lock latency.
Needs ReviewPublic

Authored by becker.greg_att.net on Jul 19 2023, 3:53 PM.
Referenced Files
F108808385: D41092.id125133.diff
Tue, Jan 28, 4:01 AM
F108808384: D41092.id.diff
Tue, Jan 28, 4:01 AM
F108808381: D41092.id124908.diff
Tue, Jan 28, 4:01 AM
F108808380: D41092.id125305.diff
Tue, Jan 28, 4:01 AM
F108808375: D41092.id124866.diff
Tue, Jan 28, 4:01 AM
F108807942: D41092.diff
Tue, Jan 28, 3:53 AM
Unknown Object (File)
Sun, Jan 26, 6:07 PM
Unknown Object (File)
Fri, Jan 24, 5:37 PM

Details

Reviewers
kib
jhb
Summary

The latency to acquire+release a pshared mutex on FreeBSD is roughly
3.4 times that of a normal/default mutex, whereas on Linux there is
no measurable difference.

Worse, in the face of highly concurrent pshared lock lookups for
completely independent locks the latency grows with the number of
threads. For example, for 2 threads affined to different NUMA
nodes the latency grows to 16x that of a normal/default mutex,
for 4 threads it grows to 35x, and for 8 threads it grows to 66x.

The high latency is likely due to the fact that in order to perform
a lookup at least 3 pages and many cachelines need to be touched/read,
and several different cachelines need to be modified. The poor perf
under high concurrency is likely due to cacheline thrashing of the
single pshared_hash[] hash table r/w read lock.

The concurrency issued can be completely mitigated by using a
per-bucket r/w lock on the pshared_hash[] hash table, but that
requires numerous code changes, doesn't improve the latency, and
requires much more memory for a properly aligned hash table.

Instead, by implementing a small, per-thread pshared lock lookup cache
we can reduce the lookup latency by roughly 57% in the single-threaded
case, 91% for 2 threads, 97% for 4 threads, 98% for 8 threads, ...
I.e, the latency of a cache hit is roughly constant irrespective of
the number of threads performing a pshared lock lookup.

Similarly, in a test where a single thread acquires two completely
independent nested locks (an r/w read lock followed by a mutex) we see
a throughput improvement of roughly 2.3x (3.4x for 2 threads, 3.7x for
4 threads, 4.3x for 8 threads, ...)

Test Plan

kyua test -k /usr/tests/Kyuafile

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

lib/libthr/Makefile
52

We should set this to zero if the platform doesn't support 64-bit atomic operations. Is that possible to detect at build time?
Also, is it desirable to have a knob in src.conf to disable/enable this feature?

A buildworld with D41092.id124866.diff added to 239597e0309d4 doesn't compile for me:

===> lib/libthr (obj,all,install)
cc -march=i686 -mmmx -msse -msse2 -target x86_64-unknown-freebsd14.0 -m32  -DCOMPAT_LIBCOMPAT=\"32\"  -DCOMPAT_libcompat=\"32\"  -DCOMPAT_LIB32  --sysroot=/usr/obj/usr/src/amd64.amd64/tmp  -B/usr/obj/usr/src/amd64.amd64/tmp/usr/bin -B/usr/obj/usr/src/amd64.amd64/tmp/usr/lib32  -O2 -pipe -fno-common  -DPTHREAD_KERNEL -I/usr/src/lib/libc/include -I/usr/src/lib/libc/i386 -I/usr/src/lib/libthr/thread -I/usr/src/lib/libthr/arch/i386/include -I/usr/src/lib/libthr/sys -I/usr/src/libexec/rtld-elf -I/usr/src/libexec/rtld-elf/i386 -I/usr/src/lib/libthread_db -fexceptions -D_PTHREAD_FORCED_UNWIND -D_PTHREAD_PSC_SIZE=11 -D_PTHREADS_INVARIANTS -mno-mmx -mno-sse -mno-avx -mno-avx2 -MD  -MF.depend.thr_pshared.o -MTthr_pshared.o -std=gnu99 -Wno-format-zero-length -Wsystem-headers -Werror -Wall -Wno-format-y2k -W -Wno-unused-parameter -Wstrict-prototypes -Wmissing-prototypes -Wpointer-arith -Wreturn-type -Wcast-qual -Wwrite-strings -Wswitch -Wshadow -Wunused-parameter -Wcast-align -Wchar-subscripts -Wnested-externs -Wold-style-definition -Wno-pointer-sign -Wdate-time -Wmissing-variable-declarations -Wno-empty-body -Wno-string-plus-int -Wno-unused-const-variable -Wno-error=unused-but-set-parameter  -Qunused-arguments    -c /usr/src/lib/libthr/thread/thr_pshared.c -o thr_pshared.o
/usr/src/lib/libthr/thread/thr_pshared.c:137:4: error: call to undeclared function 'atomic_add_rel_64'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                        PSHARED_HTGEN_INVALIDATE();
                        ^
/usr/src/lib/libthr/thread/thr_pshared.c:68:36: note: expanded from macro 'PSHARED_HTGEN_INVALIDATE'
#define PSHARED_HTGEN_INVALIDATE()      atomic_add_rel_64(&pshared_htgen, 1)
                                        ^
/usr/src/lib/libthr/thread/thr_pshared.c:137:4: note: did you mean 'atomic_add_long'?
/usr/src/lib/libthr/thread/thr_pshared.c:68:36: note: expanded from macro 'PSHARED_HTGEN_INVALIDATE'
#define PSHARED_HTGEN_INVALIDATE()      atomic_add_rel_64(&pshared_htgen, 1)
                                        ^
/usr/obj/usr/src/amd64.amd64/tmp/usr/include/i386/atomic.h:602:1: note: 'atomic_add_long' declared here
ATOMIC_ASM(add,      long,  "addl %1,%0",  "ir",  v);
^
/usr/obj/usr/src/amd64.amd64/tmp/usr/include/i386/atomic.h:105:26: note: expanded from macro 'ATOMIC_ASM'
static __inline void                                    \
                                                        ^
<scratch space>:208:1: note: expanded from here
atomic_add_long
^
/usr/src/lib/libthr/thread/thr_pshared.c:158:10: error: call to undeclared function 'atomic_load_acq_64'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
        htgen = atomic_load_acq_64(&pshared_htgen);
                ^
/usr/src/lib/libthr/thread/thr_pshared.c:158:10: note: did you mean 'atomic_load_acq_int'?
/usr/obj/usr/src/amd64.amd64/tmp/usr/include/i386/atomic.h:611:1: note: 'atomic_load_acq_int' declared here
ATOMIC_LOADSTORE(int);
^
/usr/obj/usr/src/amd64.amd64/tmp/usr/include/i386/atomic.h:606:2: note: expanded from macro 'ATOMIC_LOADSTORE'
        ATOMIC_LOAD(TYPE);                              \
        ^
/usr/obj/usr/src/amd64.amd64/tmp/usr/include/i386/atomic.h:253:30: note: expanded from macro 'ATOMIC_LOAD'
static __inline u_##TYPE                                        \
                                                                ^
<scratch space>:241:1: note: expanded from here
atomic_load_acq_int
^
/usr/src/lib/libthr/thread/thr_pshared.c:243:4: error: call to undeclared function 'atomic_add_rel_64'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
                        PSHARED_HTGEN_INVALIDATE();
                        ^
/usr/src/lib/libthr/thread/thr_pshared.c:68:36: note: expanded from macro 'PSHARED_HTGEN_INVALIDATE'
#define PSHARED_HTGEN_INVALIDATE()      atomic_add_rel_64(&pshared_htgen, 1)
                                        ^
/usr/src/lib/libthr/thread/thr_pshared.c:322:2: error: call to undeclared function 'atomic_add_rel_64'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration]
        PSHARED_HTGEN_INVALIDATE();
        ^
/usr/src/lib/libthr/thread/thr_pshared.c:68:36: note: expanded from macro 'PSHARED_HTGEN_INVALIDATE'
#define PSHARED_HTGEN_INVALIDATE()      atomic_add_rel_64(&pshared_htgen, 1)
                                        ^
4 errors generated.
*** Error code 1

Stop.
make[4]: stopped in /usr/src/lib/libthr
*** Error code 1

Stop.
make[3]: stopped in /usr/src
*** Error code 1

Stop.
make[2]: stopped in /usr/src
       36.90 real        34.30 user         2.47 sys
*** Error code 1

Stop.
make[1]: stopped in /usr/src
*** Error code 1

Stop.
make: stopped in /usr/src
19:56 /usr/src $
brooks added inline comments.
lib/libthr/Makefile
52

I don't think we have a way to detect this other than enumerating architectures. A somewhat overreaching method would be:

.if ${MACHINE_ABI:Mlong32}
# assume no 64-bit atomics
.else
...

We could also add a new MACHINE_ABI value for atomic64.

Use MACHINE_ABI to build only for 64-bit architectures.
Replace uint64_t with u_long for pshared hash table generation counts.

Doesn't it make sense to fill curthread cache on key/val creation in pshared_insert? It is highly likely that the same thread would use the lock obj.

lib/libthr/thread/thr_pshared.c
203

Why do you need acquire semantic for reading of pshared_htgen? What are happen-before relations you want to establish there? (Same question for release on pshared_htgen update).

lib/libthr/thread/thr_pshared.c
203

Sigh, I went back and forth on this, and your question made me have to revisit the logic. I will be pushing an update soon that addresses this question in comments in the code.

Fix how generation count is updated and synchronized with readers.
Add comments to explain how it works.

lib/libthr/Makefile
52

Thanks Brooks, I just used the affirmative case of long64, since undefined macros are zero by default.
I've cross-compiled for i386, amd64, powerpc, powerpc64, aarch64, and arm7 and it seems to build correctly.

lib/libthr/thread/thr_pshared.c
86

I would appreciate someone taking a hard look at my comment and code here. My gut sense is that we only need the call to atomic_add_rel_long() in order to be sequentially consistent with other threads calling atomic_load_acq_long() on pshared_htgen.

But then reading the paragraph regarding "Release-Aquire Ordering" from **https://en.cppreference.com/w/c/atomic/memory_order** made me doubt my thinking and I so came up with this monstrosity because I'm no longer 100% certain when the change to pshared_htgen would be visible (I used to think it would be before the release barrier).

Unfortunately, I don't have a weak memory machine on which to stress this code to find problems. It seems to work just fine without the atomic_add_long() on amd64.

Thoughts?

lib/libthr/thread/thr_pshared.c
86

I do not understand this still. Why doing increment twice to invalidate?

More important, there is nothing that ensures that our thread sees pshared_htgen bump before doing the local access to the cache. We might simply miss it, and the bump becomes visible right after lookup code read pshared_htgen. You cannot sync these two events without some sort of blocking.

What you might try to do is to re-check that cache is still valid after the lookup is done, i.e. that pshared_htgen did not changed. This would also require fences around internal lookup access to cache to ensure that loads from htgen are ordered with cache.

lib/libthr/thread/thr_pshared.c
86

Yeah, I'm just trying to get something for free. Unfortunately, my current solution is unable leverage the "promise" noted in the last sentence of the following:

If an atomic store in thread A is tagged memory_order_release, an atomic load in thread B from the same variable is tagged memory_order_acquire, and the load in thread B reads a value written by the store in thread A, then the store in thread A synchronizes-with the load in thread B.

All memory writes (including non-atomic and relaxed atomic) that happened-before the atomic store from the point of view of thread A, become visible side-effects in thread B. That is, once the atomic load is completed, thread B is guaranteed to see everything thread A wrote to memory. This promise only holds if B actually returns the value that A stored, or a value from later in the release sequence.

Lemme try out your idea, that looks good, thanks! I have another, different idea, but it's heavier weight, so if I can get your idea implemented correctly then I think it will perform much better and be on-par with my current, albeit racy solution.

lib/libthr/thread/thr_pshared.c
86

I realized that what I described is half-done (and half-functional) seq lock. Take a look at sys/seqc.h. It should be easy to adopt the approach to the userspace. In all places where kernel does spinwait, thr_pshared.c would probably need invalidate the cache instead.

Updated to quasi-follow seqlock semantics from seqc.h for continued discussion...

lib/libthr/thread/thr_pshared.c
86

Thanks Konstantin, I've updated the patch to follow what I see in seqc.h, although pshared_lookup() is a bit of a mashup of seqc_read() and seqc_consistent(). Hopefully I didn't miss anything, curious to see what you think...

Refactor to more closely follow seqc.h.

I tried to organize my thought about this cache proposal, and I do think that I managed to understand it better now.

First, current 'locking' schemas are centered about an idea of invalidating cache on lock object destroy, i.e. not returning the cached answer if the key was removed. IMO this is too bold, and probably not achievable with unlocked code, since you could always read a stale value. More, this case is not too interesting, because in this case it means that application tries to use destroyed lock object, which is an application but.

Second, I now think that the (only) real problem there is ABA: application creates lock for key K and mapping at V1, then destroys it, then initializes another lock object with the key K and a new mapping at V2. If cache could return K/V1 after V2 is instantiated, we are in trouble. Of course, ensuring that thread-local caches forget about K right on K/V1 deletion would solve V1->V2 problem as well, but I do not see how make it work.

I think I have more or less workable scheme to make look-aside cache working.

First, make the cache global instead of thread-local. Second, store (key, val) into the cache cell. This requires something like cmpxchg16b/cmpxchg8 on x86, or ll/cs of double-word width on other arches, which should have it (if not, disable the cache). Store the (key, val) on shared lock creation or first lookup, delete is optional in fact. Read from cache should read whole cell (key, val) atomically. Again, on x86 double-CAS is good, other arches should have a way.

The key observation is that, if app does a lookup by key, then it must ensure by some means that the key is alive, we do not care how it ensured. Then, whatever we looked up from the global cache, is either valid, or it is app error to access it.