A new syscall is added which registers a uint32_t variable as containing the count of blocks for signal delivery. Its content is read by kernel on each syscall entry and on AST processing, non-zero count of blocks is interpreted same as the signal mask blocking all signals.
Main purpose of the fast sigblock feature is to allow rtld to not issue two sigprocmask(2) syscalls for each symbol binding operation in single-threaded processes. Rtld needs to block signals as part of locking to ensure signal safety of the bind process, because signal handlers might need to lazily resolve symbol references. For multi-threaded processes, libthr intercepts all signals handlers which allows to delay signal delivery by manually raising the delivered signal upon unlock. This is not ideal, and I did not wanted to make libc intercept signal handlers too.
There is some rudimentary use of the fast sigblock in libthr, but it is not related to the critical sections, which are still use signal raising on exit. This is because critical sections have to handle more issues than only signals.
Benchmarks do not show a slow-down of the syscalls in micro-measurements, and macro benchmarks like buildworld do not demonstrate a difference. Part of the reason is that buildworld time is dominated by compiler, and clang already links to libthr. On the other hand, small utilities typically used by shell scripts have the total number of syscalls cut by half.
The biggest downside of the feature that I see is that memory corruption that affects the registered fast sigblock location, would cause quite strange application misbehavior. For instance, the process would be immune to ^C (but killable by SIGKILL).
Benchmark
stock root@r-freeb43:/data/work # ( truss ./hello > /dev/null ) |& wc -l 63 root@r-freeb43:/data/work # ./syscall_timing getppid Clock resolution: 0.000000001 test loop time iterations periteration getppid 0 1.016000291 2897051 0.000000350 getppid 1 1.009834382 2877363 0.000000350 getppid 2 1.000017563 2848788 0.000000351 getppid 3 1.027104944 2927389 0.000000350 getppid 4 1.017986489 2901363 0.000000350 getppid 5 1.010239487 2879487 0.000000350 getppid 6 1.004737854 2861549 0.000000351 getppid 7 1.016996867 2902481 0.000000350 getppid 8 1.000985560 2854247 0.000000350 getppid 9 1.023981476 2921559 0.000000350 buildworld -s -j 32 2950.74 real 70371.54 user 2495.92 sys 3033.48 real 70085.36 user 2482.85 sys 2985.57 real 70240.54 user 2495.64 sys 2927.52 real 70204.11 user 2486.19 sys 3007.15 real 70140.09 user 2489.78 sys fast sigblock root@r-freeb43:/tmp # ( truss ./hello > /dev/null ) | & wc -l 37 root@r-freeb43:/tmp # ./syscall_timing getppid Clock resolution: 0.000000001 test loop time iterations periteration getppid 0 1.000028797 2888175 0.000000346 getppid 1 1.050218490 3031970 0.000000346 getppid 2 1.027986764 2967578 0.000000346 getppid 3 1.046828208 3021258 0.000000346 getppid 4 1.000027060 2886588 0.000000346 getppid 5 1.027107249 2964878 0.000000346 getppid 6 1.021487553 2950501 0.000000346 getppid 7 1.001574229 2890893 0.000000346 getppid 8 1.009152278 2914871 0.000000346 getppid 9 1.042734749 3012841 0.000000346 buildworld -s -j 32 2998.44 real 70147.80 user 2478.18 sys 2935.23 real 70159.03 user 2490.75 sys 2961.46 real 70072.53 user 2501.04 sys 2939.97 real 70115.35 user 2504.80 sys 2978.95 real 70006.44 user 2509.96 sys