The C variant is very slow as it compiles to one byte comparison per loop iteration. rep cmpsb (used by memcmp in libc) turns out to have very bad throughput as well.
The variant below contains an unrolled loop for one byte comparisons and a dedicated 32-byte loop. It significantly outperforms rep cmps even for bigger sizes (e.g. 1024).
Depending on size this is about 3-4 times faster than the current routine.
This is a patch for the kernel, libc will follow later.