Adds a SIMD enhanced strlen for Aarch64. It takes inspiration from
the amd64 implementation but I struggled getting the performance
I had hoped for on cores like the Graviton3 when compared to the
existing implementation from Arm Optimized Routines.
Benchmark results are also available for a simple SIMD variant
loading 16 bytes at a time and checking with a simple cmeq,shrn,fcmp loop.
os: FreeBSD arch: arm64 cpu: ARM Neoverse-V1 r1p1 │ strlen_ARM │ strlen_SIMD │ strlen_SIMD_UNROLL │ │ sec/op │ sec/op vs base │ sec/op vs base │ Short 93.82µ ± 1% 100.94µ ± 2% +7.59% (p=0.000 n=20) 105.13µ ± 0% +12.05% (p=0.000 n=20) Mid 23.56µ ± 6% 23.01µ ± 0% -2.36% (p=0.000 n=20) 23.79µ ± 0% ~ (p=0.602 n=20) Long 3.065µ ± 0% 3.541µ ± 0% +15.54% (p=0.000 n=20) 3.172µ ± 0% +3.50% (p=0.000 n=20) geomean 18.92µ 20.18µ +6.67% 19.94µ +5.39% │ strlen_ARM │ strlen_SIMD │ strlen_SIMD_UNROLL │ │ B/s │ B/s vs base │ B/s vs base │ Short 1.241Gi ± 1% 1.153Gi ± 2% -7.06% (p=0.000 n=20) 1.107Gi ± 0% -10.75% (p=0.000 n=20) Mid 4.940Gi ± 6% 5.060Gi ± 0% +2.42% (p=0.000 n=20) 4.894Gi ± 0% ~ (p=0.602 n=20) Long 37.99Gi ± 0% 32.88Gi ± 0% -13.45% (p=0.000 n=20) 36.70Gi ± 0% -3.38% (p=0.000 n=20) geomean 6.152Gi 5.768Gi -6.25% 5.837Gi -5.12% os: FreeBSD arch: arm64 cpu: ARM Cortex-A76 r4p1 │ strlen_ARM │ strlen_SIMD │ strlen_SIMD_UNROLL │ │ sec/op │ sec/op vs base │ sec/op vs base │ Short 134.5µ ± 0% 119.2µ ± 0% -11.36% (p=0.000 n=20) 115.3µ ± 0% -14.29% (p=0.000 n=20) Mid 37.10µ ± 0% 29.60µ ± 0% -20.23% (p=0.000 n=20) 33.29µ ± 1% -10.27% (p=0.000 n=20) Long 4.442µ ± 0% 5.661µ ± 0% +27.44% (p=0.000 n=20) 4.267µ ± 0% -3.94% (p=0.000 n=20) geomean 28.09µ 27.13µ -3.41% 25.39µ -9.60% │ strlen_ARM │ strlen_SIMD │ strlen_SIMD_UNROLL │ │ B/s │ B/s vs base │ B/s vs base │ Short 886.2Mi ± 0% 999.8Mi ± 0% +12.82% (p=0.000 n=20) 1034.0Mi ± 0% +16.68% (p=0.000 n=20) Mid 3.138Gi ± 0% 3.933Gi ± 0% +25.37% (p=0.000 n=20) 3.497Gi ± 1% +11.45% (p=0.000 n=20) Long 26.21Gi ± 0% 20.57Gi ± 0% -21.53% (p=0.000 n=20) 27.28Gi ± 0% +4.10% (p=0.000 n=20) geomean 4.144Gi 4.291Gi +3.54% 4.584Gi +10.62%