The baseline implementation is very straightforward,
while the scalar implementation suffers from register pressure
and the need to use SWAR techniques similar to those used for
strchr().
Performance is ok-ish. Slower than glibc, but glibc gets to use AVX-512
which this one doesn't. See this commit for results:
s: FreeBSD arch: amd64 cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz │ strrchr.pre.out │ strrchr.scalar.out │ strrchr.baseline.out │ │ sec/op │ sec/op vs base │ sec/op vs base │ Short 111.51µ ± 0% 82.39µ ± 1% -26.11% (p=0.000 n=20) 45.19µ ± 0% -59.48% (p=0.000 n=20) Mid 66.19µ ± 0% 23.44µ ± 0% -64.59% (p=0.000 n=20) 10.59µ ± 0% -84.00% (p=0.000 n=20) Long 51.422µ ± 0% 15.932µ ± 0% -69.02% (p=0.000 n=20) 5.972µ ± 0% -88.39% (p=0.000 n=20) geomean 72.40µ 31.33µ -56.72% 14.19µ -80.40% │ strrchr.pre.out │ strrchr.scalar.out │ strrchr.baseline.out │ │ B/s │ B/s vs base │ B/s vs base │ Short 1.044Gi ± 0% 1.413Gi ± 1% +35.34% (p=0.000 n=20) 2.576Gi ± 0% +146.76% (p=0.000 n=20) Mid 1.759Gi ± 0% 4.967Gi ± 0% +182.42% (p=0.000 n=20) 10.996Gi ± 0% +525.18% (p=0.000 n=20) Long 2.264Gi ± 0% 7.307Gi ± 0% +222.76% (p=0.000 n=20) 19.493Gi ± 0% +761.03% (p=0.000 n=20) geomean 1.608Gi 3.715Gi +131.07% 8.204Gi +410.23% os: Linux arch: x86_64 cpu: │ strrchr.glibc.out │ │ sec/op │ Short 28.91µ ± 2% Mid 8.588µ ± 0% Long 2.113µ ± 0% geomean 8.064µ │ strrchr.glibc.out │ │ B/s │ Short 4.027Gi ± 2% Mid 13.56Gi ± 0% Long 55.10Gi ± 0% geomean 14.44Gi
Sponsored by: The FreeBSD Foundation