sort(1): Memoize MD5 computation to reduce repeated computation
ClosedPublic
Actions

Authored by cem on Apr 11 2019, 11:14 PM.

Details

Reviewers

markj
kevans
jhb

Commits

rS346175: sort(1): Memoize MD5 computation to reduce repeated computation

Summary

Experimentally, reduces sort -R time of a 148160 line corpus from about
3.15s to about 0.93s on this particular system.

There's probably room for improvement using some digest other than md5,
but I don't want to look at sort(1) anymore.

Test Plan

So the easy/fast thing would be to just assign unique indices to each element (CTR-mode cipher blocks like AES or Chacha) and radix sort based on that, right? You can sort in O(N) time instead of O(N log N) because of the radix property. The difficulty is that sort(1) is expected to process inputs that do not fit in memory, and will spill to disk with partial results and merge later. So if you generate this extra metadata, you face some challenges:

it must be persisted somehow if the memory contents are flushed to a temporary file; the easiest way *might* be to just throw it in as an extra record in the row, but the more efficient way would probably be to persist the CTR-mode generator state at the beginning of a file and associate it with the specific temporary file; the latter is easy to do with code we have header access for, but difficult for e.g., OpenSSL. An alternative to preserving the entire internal data structure would be to simply roll the IV used and reset the cipher every time we spill a newly processed file. That could work.
(you'd like to use OpenSSL, because it has the various assembly optimized implementations and it's already in base.)
during merge, the radix indices must be re-associated with the correct corresponding records
if records are short, this extra index may be essentially double the memory used by records, so it's not like we can just keep it in memory. This makes the uniqueness property more difficult to ensure. (Not that md5 guarantees uniqueness, exactly.)
The records themselves must be checked for equality before we consult the radix, due to sort -R's sort of odd property "that the equal keys sort together." I'm not sure if that is intentionally or just a natural result of the chosen implementation. I'm not sure if it would be a POLA violation to break that or not, but I'd probably err on the side of not breaking it. (There is no POSIX definition of sort -R, and in fact our -R flag actually conflicts with NetBSD, where -R <char> means "use char as the record separator.")
Finally, we don't know the record count in advance; input may even come from a pipe like stdin. So we cannot "correctly size" our radix indices to the total record count to save a little memory. Maybe hard-coding 16 bytes (same as MD5) would be adequate.

Maybe I'm thinking about this wrong, but the above is what's holding me back from just replacing the MD5 consistent hashing scheme with a seeded block cipher radix scheme.

A tangential incremental improvement might be to just swap MD5 for a faster digest. I don't know of any reason a *cough* cryptographic digest like *cough* MD5 is required for this application; any old hash function will create the same output for identical inputs, satisfying that equal keys sort together property. Some benchmarks put SHA1 and Blake2b on about even footing, mildly faster than MD5. Something more suitable might be something like Yann Collet's XXH3 (which can create digests as large as MD5's), or others in this list.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

cem created this revision.Apr 11 2019, 11:14 PM

Harbormaster completed remote builds in B23637: Diff 56114.Apr 11 2019, 11:14 PM

Small typo.

usr.bin/sort/coll.c
984 ↗	(On Diff #56114)	s/memoize/memorize

cem added inline comments.Apr 12 2019, 8:03 AM

usr.bin/sort/coll.c
984 ↗	(On Diff #56114)	It's not a typo: https://en.wikipedia.org/wiki/Memoization

Thanks, and sorry for the noise!

In D19885#427190, @juan.molina_club.fr wrote:

Thanks, and sorry for the noise!

No problem, I appreciate that you're checking out random differentials for spelling errors before commit. It's good to spell well and be precise! Cheers.

jhb added inline comments.Apr 12 2019, 5:31 PM

usr.bin/sort/coll.c
1016 ↗	(On Diff #56114)	This is fine as-is. To be pedantic, you could be slightly more optimal if you structured the code as something like: if (kv1->hint->status == HS_UNINITIALIZED) randomcoll_init_hint(kv1); if (kv2->hint->status == HS_UNINITIALIZED) randomcoll_init_hint(kv2); return (memcmp(kv1->hint->v.Rh.cached, ...)); Doing this would mean that randomcoll_init_hint() would do the actual MD5 now, something like: void randomcoll_init_hint(struct key_value *kv) { MD5_CTX ctx; memcpy(&ctx, &md5_ctx, sizeof(MD5_CTX)); MD5Update(&ctx, bwsrawdata(kv->k), bsrawlen(kv->k)); MD5Final(kv->hint->v.Rh.cached, &ctx); kv->hint->status = HS_INITIALIZED; } I'm fine either way though.
1020 ↗	(On Diff #56114)	I don't understand why this comparison is here. This is just going to redo the same comparison below at the end of the function so you are always rehashing if there's a hash collision? It seems like this should just be 'return (memcmp(...));'

cem edited the test plan for this revision. (Show Details)Apr 12 2019, 5:31 PM

cem marked an inline comment as done.Apr 12 2019, 6:44 PM

cem added inline comments.

usr.bin/sort/coll.c
1016 ↗	(On Diff #56114)	Yeah, I thought of doing something like that. See note below.
1020 ↗	(On Diff #56114)	The `Rh.cached` hint only stores the first 12/16 bytes of the MD5. So collisions are possible in the cached prefix that are not possible in the full hash (something like 50% chance at 2^48 items, I think). So two other options might be: Just ignore the increased likelihood of a collision and rely on the truncated digest; I don't think 16 bytes of digest was chosen for any scientific reason. Or store the full 16 byte digest in the hint. This would bloat out the size of the hint union slightly on 32-bit architectures, which is the only reason I didn't just do that.
usr.bin/sort/coll.h
41–96 ↗	(On Diff #56114)	Note the size of (1) R_hint::cached, and (2) other hints in this union, which is allocated for any collation method that requests hinting. To avoid bloating the union, I only cache the first 12 bytes of the MD5.

jhb accepted this revision.Apr 12 2019, 7:00 PM

jhb added inline comments.

usr.bin/sort/coll.c
1020 ↗	(On Diff #56114)	Ok, I think your code is fine as-is then. My initial reaction would be to do 2) and store the full hash, but on second blush I agree with your approach. I would maybe add a comment to note this as it's non-obvious IMO. Maybe when Rhint is defined next to the magic '12', something like: /* * This stores the first 12 bytes of the hash rather than the full hash to avoid * increasing the size of the hint object. */