amd64: move memcmp checks upfront
This is a tradeoff which saves jumps for smaller sizes while making
the 8-16 range slower (roughly in line with the other cases).
Tested with glibc test suite.
For example size 3 (most common with vfs namecache) (ops/s):
before: 407086026
after: 461391995
The regressed range of 8-16 (with 8 as example):
before: 540850489
after: 461671032