amd64: sync up libc memset with the kernel version
- tidy up memset to have rax set earlier for small sizes
- finish the tail in memset with an overlapping store
- align memset buffers to 16 bytes before using rep stos
Sponsored by: The FreeBSD Foundation