Experimentally, reduces sort -R time of a 148160 line corpus from about
3.15s to about 0.93s on this particular system.
There's probably room for improvement using some digest other than md5,
but I don't want to look at sort(1) anymore.
Differential D19885
sort(1): Memoize MD5 computation to reduce repeated computation cem on Apr 11 2019, 11:14 PM. Authored by Tags None Referenced Files
Subscribers
Details Experimentally, reduces sort -R time of a 148160 line corpus from about There's probably room for improvement using some digest other than md5, So the easy/fast thing would be to just assign unique indices to each element (CTR-mode cipher blocks like AES or Chacha) and radix sort based on that, right? You can sort in O(N) time instead of O(N log N) because of the radix property. The difficulty is that sort(1) is expected to process inputs that do not fit in memory, and will spill to disk with partial results and merge later. So if you generate this extra metadata, you face some challenges:
Maybe I'm thinking about this wrong, but the above is what's holding me back from just replacing the MD5 consistent hashing scheme with a seeded block cipher radix scheme. A tangential incremental improvement might be to just swap MD5 for a faster digest. I don't know of any reason a *cough* cryptographic digest like *cough* MD5 is required for this application; any old hash function will create the same output for identical inputs, satisfying that equal keys sort together property. Some benchmarks put SHA1 and Blake2b on about even footing, mildly faster than MD5. Something more suitable might be something like Yann Collet's XXH3 (which can create digests as large as MD5's), or others in this list.
Diff Detail
Event Timeline
Comment Actions No problem, I appreciate that you're checking out random differentials for spelling errors before commit. It's good to spell well and be precise! Cheers.
|