amd64: tidy up kernel memmove, take 2
There is no need to use %rax for temporary values and avoiding doing
so shortens the func.
Handle the explicit 'check for tail' depessimisization for backwards copying.
This reduces the diff against userspace.
Tested with the glibc test suite.
Approved by: re (kib)