The reimplementation is a bit cleaner than the original code.
Optimised implementations are provided for amd64 and aarch64.
For amd64, we have three implementations. One for baseline,
one using ANDN from BMI1 and one using AVX-512 (though it's not
really vectorised). Here's the performance:
AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics (Zen 4): orig 13.22s 774.6 MB/s generic 13.50s 758.5 MB/s baseline 10.83s 945.5 MB/s bmi1 9.62s 1062.4 MB/s avx512 10.94s 936.0 MB/s 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (Tigerlake): orig 16.90s 605.9 MB/s generic 17.42s 587.8 MB/s baseline 13.38s 765.3 MB/s bmi1 11.99s 854.0 MB/s avx512 10.61s 965.1 MB/s ARM Cortex-X1C (Windows 2023 Dev Kit perf core): pre 35.2s (291 MB/s) scalar 34.5s (297 MB/s) ARM Cortex-A78C (Windows 2023 Dev Kit efficiency core): pre 46.8s (219 MB/s) scalar 44.5s (230 MB/s)
The kernel always gets the "generic" version. A macro makes it so that
the copy in stand/libsa is not unrolled, saving precious loader space.
Obtained from: https://github.com/animetosho/md5-optimisation/