The reimplementation is a bit cleaner than the original code,
although it is also slightly slower. This shouldn't matter too
much as we have asm code for the major platforms.
Optimised implementations are provided for amd64 and aarch64.
For amd64, we have three implementations. One for baseline,
one using ANDN from BMI1 and one using AVX-512 (though it's not
really vectorised). Here's the performance:
```
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz:AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics (Zen 4):
pre 17.0s (602orig 13.22s 774.6 MB/s)
generic 18.8s (54 13.50s 758.5 MB/s)
scalar 13.4s (764baseline 10.83s 945.5 MB/s)
bmi1 12.0s (853 9.62s 1062.4 MB/s)
avx512 10.6s (966 10.94s 936.0 MB/s)
11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (Tigerlake):
orig 16.90s 605.9 MB/s
generic 17.42s 587.8 MB/s
baseline 13.38s 765.3 MB/s
bmi1 11.99s 854.0 MB/s
avx512 10.61s 965.1 MB/s
ARM Cortex-X1C (Windows 2023 Dev Kit perf core):
pre 35.2s (291 MB/s)
generic 36.4s (281 MB/s)
scalar 34.5s (297 MB/s)
ARM Cortex-A78C (Windows 2023 Dev Kit efficiency core):
pre 46.8s (219 MB/s)
generic 47.3s (216 MB/s)
scalar 44.5s (230 MB/s)
```
This changeset will have to be reworked when D34497 lands.The kernel always gets the "generic" version. A macro makes it so that
I'm not sure how to apply the SIMD code to all uses of MD5.
This changeset anticipates D34498 and no longer provides the
transform and block symbolsthe copy in stand/libsa is not unrolled, saving precious loader space.
Obtained from: https://github.com/animetosho/md5-optimisation/