I used a microbenchmark which locks and unlocks a mutex one billion times.
On a Ryzen 7950X3D the runtime decreases, from 4.53s to 4.15s. When libthr
is loaded, either at program start time or by dlopen(), there's no change
(the runtime is about 10.2s).
Without the change, a typical stub looks like this:
```
Dump of assembler code for function pthread_mutex_lock_exp:
0x000000000009b530 <+0>: push %rbp
0x000000000009b531 <+1>: mov %rsp,%rbp
0x000000000009b534 <+4>: mov 0x14ddc5(%rip),%rax # 0x1e9300
0x000000000009b53b <+11>: pop %rbp
0x000000000009b53c <+12>: jmp *0x2a0(%rax)
```
whereas with the change it becomes:
```
Dump of assembler code for function pthread_mutex_lock_exp:
0x000000000009b380 <+0>: push %rbp
0x000000000009b381 <+1>: mov %rsp,%rbp
0x000000000009b384 <+4>: pop %rbp
0x000000000009b385 <+5>: jmp *0x14ecd5(%rip) # 0x1ea060 <__thr_jtable+672>
```
(and the frame pointer manipulations look a bit silly now.) I see a similar change on arm64:
```
Dump of assembler code for function pthread_mutex_lock_exp:
0x00000000000a8e5c <+0>: adrp x8, 0x1f1000
0x00000000000a8e60 <+4>: ldr x8, [x8, #3240]
0x00000000000a8e64 <+8>: ldr x1, [x8, #672]
0x00000000000a8e68 <+12>: br x1
```
after:
```
Dump of assembler code for function pthread_mutex_lock_exp:
0x00000000000a8dc4 <+0>: adrp x8, 0x202000
0x00000000000a8dc8 <+4>: ldr x1, [x8, #1296]
0x00000000000a8dcc <+8>: br x1
```