Clang 6.0.1 -O2 -mno-sse (-funroll-loops makes no difference, nor does -O3) (without -mno-sse, Clang unrolls to two 128-bit xmm register loads and a single xor, but we build kernel and libc with -mno-sse for obvious reasons and vectorization to 64-bit registers is still valuable, especially compared to the following trash code).