The current routines are extremely crude and trivially can get the same benefit as kernel variants (pre-ERMS) in https://svnweb.freebsd.org/base?view=revision&revision=334537
The forward jump overhead when it is needed is pretty much negligible compared to the extra rep spinning up.
Note that decent versions of these routines would be very different than kernel ones as they should possibly use simd, either way there is no point trying to move them all into a shared file.
memset requires a little bit more work and will be dealt with separately