it unrolls all loops super naively. With uint64_t it's adequate (similar to GCC with -fpeel-loops but without -funroll-loops), but the uint8_t version is unrolled to 165 bytes of code (16 individual mov/mov/xors). Same at -O3, or -funroll-loops. May be better with Clang7, which is in head but I have not yet compiled it. (And I can't seem to access pkg right now.)