amd64: implement strlen in assembly
Tested with glibc test suite and a custom test which can be found in the
The C variant in libkern performs excessive branching to find the
zero byte instead of using the bsfq instruction. The same code
patched to use it is still slower than the routine implemented here
as the compiler keeps neglecting to perform certain optimizations
(like using leaq).
On top of that the routine can is a starting point for copyinstr
which operates on words instead of bytes.
The previous attempt had an instance of swapped operands to
andq when dealing with fully aligned case, which had a side effect
of breaking the code for certain corner cases. Noted by jrtc27.
$(perl -e "print 'A' x 3"):
$(perl -e "print 'A' x 100"):