Page MenuHomeFreeBSD

making cmp faster, especially on special files (and speed up wc -l too while at it)
Needs ReviewPublic

Authored by sigsys_gmail.com on Sep 3 2018, 4:42 AM.

Details

Reviewers
None
Group Reviewers
Contributor Reviews (base)
Summary

While using cmp to check a disk for errors, I noticed that it was pretty slow.

The first problem is that it's doing small reads on the disk device file, which in turn issues small read transactions to the drive. Throughput was 50MB/s. Using dd on the disk with 128KB blocks yields 179MB/s.

Piping it through dd instead made it faster:
dd if=/dev/ada1p1 bs=128k | cmp - /dev/zero

But throughput was only 95MB/s and for some reason dd was stuck at 75% CPU. Oddly enough, this command was faster:
cmp <(dd if=/dev/ada1p1 bs=128k) /dev/zero

Then throughput was 139MB/s and the CPU was maxed out.

Turns out cmp has a mmap-based optimization that only applies to regular files; special files and pipes, etc, all fall back to a naive stdio-based implementation.

So I changed it to use read syscalls directly instead (with MAXBSIZE reads like wc(1) is doing), which makes it send larger read requests when used on a disk device file directly (and also use less CPU time). Then I added an opportunistic memcmp() and optimized the line counting for both the regular and special file cases. They both use a common routine to compare chunks now.

On this machine, considering only user time, now it's about 3 to 4 times faster for regular files and 6 to 10 times faster for non-regular files. If -s/-l/-x are used (so it doesn't need to count lines) it's about 7 to 9 times faster for regular files and more than 20 times faster for non-regular files.

It needed a fast way to count lines, so I added it to wc too (sharing the code the same way syslogd/wall are doing). It's now 3 to 5 times faster when only counting lines (-l without -L). It uses a scary looking bit twiddling hack from this page though:
https://graphics.stanford.edu/~seander/bithacks.html#HasLessInWord
It seems solid though. It's never miscounting lines based on my tests (with concatenated text files, binaries, random files, etc). On some other CPUs (especially those that won't do fast 64 bits operations) the speed up might be much less though.

Test Plan

Seems to work fine, but needs more testing and some review.

Diff Detail

Lint
Lint Skipped
Unit
Unit Tests Skipped

Event Timeline

After testing on an old Pentium 4, turns out the line counting algorithm was slower than the naive implementation because operations on 64 bits long longs were slow. Changing it to use 32 bits long makes it faster but only 1.5 times or so faster.

mmacy added a subscriber: mmacy.Sep 4 2018, 1:45 AM

> 95% of the user base is going to be running 64 bit. One has to be pretty dedicated to not have e-wasted any 32-bit hardware by now. If you care about 32 bit performance perhaps have an ifdef LP64 around the type used.

> If you care about 32 bit performance perhaps have an ifdef LP64 around the type used.

Yeah the way I changed it it's still going to use 64 bits operations on LP64 platforms. So it's still going to have the same big speed up there, but it's no longer going to slow down the 32 bits case.

Also I should add that even without that weird bit twiddling code, this patch still makes the whole thing significantly faster in all cases, so it could be removed if it's too ugly to commit.

mmacy added a comment.Sep 4 2018, 2:01 AM

> If you care about 32 bit performance perhaps have an ifdef LP64 around the type used.

Yeah the way I changed it it's still going to use 64 bits operations on LP64 platforms. So it's still going to have the same big speed up there, but it's no longer going to slow down the 32 bits case.
Also I should add that even without that weird bit twiddling code, this patch still makes the whole thing significantly faster in all cases, so it could be removed if it's too ugly to commit.

Ok. I just read the comment. Apart from the bit twiddling hack the code looks straightforward enough. Maybe it’s not fair to ask this, but it would be comforting if there were regression tests attached to this change.

Ok. I just read the comment. Apart from the bit twiddling hack the code looks straightforward enough. Maybe it’s not fair to ask this, but it would be comforting if there were regression tests attached to this change.

I'll look into it. I'll have to figure out how the FreeBSD regression tests work first.

If this look like something that might be committable I'll recheck and retest everything next time I've got enough free time for this. I might write some tests while at it.

Some cleanups, added a wc test case, fixed a cmp test case, fixed some wc logic that made it avoid its (already existing) fast path on STDIN.

I reread the changes and did some more testing and I found no problems.

Don't mind me, just enjoying my very fast cmp (and wc -l).

Diff updated for 12-CURRENT.

Been using this for two weeks with no problems.

Still had this in my tree, updated for 13-CURRENT.