Change Details

I used a microbenchmark which locks and unlocks a mutex one billion times. On a Ryzen 7950X3D the runtime decreases, from 4.53s to 4.15s. When libthr is loaded, either at program start time or by dlopen(), there's no change (the runtime is about 10.2s). Without the change, a typical stub looks like this: ``` Dump of assembler code for function pthread_mutex_lock_exp: 0x000000000009b530 <+0>: push %rbp 0x000000000009b531 <+1>: mov %rsp,%rbp 0x000000000009b534 <+4>: mov 0x14ddc5(%rip),%rax # 0x1e9300 0x000000000009b53b <+11>: pop %rbp 0x000000000009b53c <+12>: jmp *0x2a0(%rax) ``` whereas with the change it becomes: ``` Dump of assembler code for function pthread_mutex_lock_exp: 0x000000000009b380 <+0>: push %rbp 0x000000000009b381 <+1>: mov %rsp,%rbp 0x000000000009b384 <+4>: pop %rbp 0x000000000009b385 <+5>: jmp *0x14ecd5(%rip) # 0x1ea060 <__thr_jtable+672> ``` (and the frame pointer manipulations look a bit silly now.) I see a similar change on arm64: ``` Dump of assembler code for function pthread_mutex_lock_exp: 0x00000000000a8e5c <+0>: adrp x8, 0x1f1000 0x00000000000a8e60 <+4>: ldr x8, [x8, #3240] 0x00000000000a8e64 <+8>: ldr x1, [x8, #672] 0x00000000000a8e68 <+12>: br x1 ``` after: ``` Dump of assembler code for function pthread_mutex_lock_exp: 0x00000000000a8dc4 <+0>: adrp x8, 0x202000 0x00000000000a8dc8 <+4>: ldr x1, [x8, #1296] 0x00000000000a8dcc <+8>: br x1 ```