While using cmp to check a disk for errors, I noticed that it was pretty slow.
The first problem is that it's doing small reads on the disk device file, which in turn issues small read transactions to the drive. Throughput was 50MB/s. Using dd on the disk with 128KB blocks yields 179MB/s.
Piping it through dd instead made it faster:
dd if=/dev/ada1p1 bs=128k | cmp - /dev/zero
But throughput was only 95MB/s and for some reason dd was stuck at 75% CPU. Oddly enough, this command was faster:
cmp <(dd if=/dev/ada1p1 bs=128k) /dev/zero
Then throughput was 139MB/s and the CPU was maxed out.
Turns out cmp has a mmap-based optimization that only applies to regular files; special files and pipes, etc, all fall back to a naive stdio-based implementation.
So I changed it to use read syscalls directly instead (with MAXBSIZE reads like wc(1) is doing), which makes it send larger read requests when used on a disk device file directly (and also use less CPU time). Then I added an opportunistic memcmp() and optimized the line counting for both the regular and special file cases. They both use a common routine to compare chunks now.
On this machine, considering only user time, now it's about 3 to 4 times faster for regular files and 6 to 10 times faster for non-regular files. If -s/-l/-x are used (so it doesn't need to count lines) it's about 7 to 9 times faster for regular files and more than 20 times faster for non-regular files.
It needed a fast way to count lines, so I added it to wc too (sharing the code the same way syslogd/wall are doing). It's now 3 to 5 times faster when only counting lines (-l without -L). It uses a scary looking bit twiddling hack from this page though:
https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
It seems solid though. It's never miscounting lines based on my tests (with concatenated text files, binaries, random files, etc). On some other CPUs (especially those that won't do fast 64 bits operations) the speed up might be much less though.