So the easy/fast thing would be to just assign unique indices to each element (CTR-mode cipher blocks like AES or Chacha) and radix sort based on that, right? You can sort in O(N) time instead of O(N log N) because of the radix property. The difficulty is that `sort(1)` is expected to process inputs that do not fit in memory, and will spill to disk with partial results and merge later. So if you generate this extra metadata, you face some challenges:
- it must be persisted somehow if the memory contents are flushed to a temporary file; the easiest way *might* be to just throw it in as an extra record in the row, but the more efficient way would probably be to persist the CTR-mode generator state at the beginning of a file and associate it with the specific temporary file; the latter is easy to do with code we have header access for, but difficult for e.g., OpenSSL. An alternative to preserving the entire internal data structure would be to simply roll the IV used and reset the cipher every time we spill a newly processed file. That could work.
- (you'd like to use OpenSSL, because it has the various assembly optimized implementations and it's already in base.)
- during merge, the radix indices must be re-associated with the correct corresponding records
- if records are short, this extra index may be essentially double the memory used by records, so it's not like we can just keep it in memory. This makes the uniqueness property more difficult to ensure. (Not that md5 guarantees uniqueness, exactly.)
- The records themselves must be checked for equality before we consult the radix, due to `sort -R`'s sort of odd property "that the equal keys sort together." I'm not sure if that is intentionally or just a natural result of the chosen implementation. I'm not sure if it would be a POLA violation to break that or not, but I'd probably err on the side of not breaking it. (There is no POSIX definition of `sort -R`, and in fact our `-R` flag actually conflicts with NetBSD, where `-R <char>` means "use `char` as the record separator.")
- Finally, we don't know the record count in advance; input may even come from a pipe like stdin. So we cannot "correctly size" our radix indices to the total record count to save a little memory. Maybe hard-coding 16 bytes (same as MD5) would be adequate.
Maybe I'm thinking about this wrong, but the above is what's holding me back from just replacing the MD5 consistent hashing scheme with a seeded block cipher radix scheme.
A tangential incremental improvement might be to just swap MD5 for a faster digest. I don't know of any reason a *cough* cryptographic digest like *cough* MD5 is required for this application; any old hash function will create the same output for identical inputs, satisfying that equal keys sort together property. Some benchmarks put SHA1 and Blake2b on about even footing, mildly faster than MD5. Something more suitable might be something like Yann Collet's [[ http://fastcompression.blogspot.com/2019/03/presenting-xxh3.html | XXH3 ]] (which can create digests as large as MD5's), or others in [[ https://github.com/Cyan4973/smhasher | this list ]].