The implementation of SHA1 is replaced with a transcription of
Go's implementation, which is much easier to read and delivers
similar or better performance than the old SSLeay code.
Some of Go's assembly implementations, paired with hand-written
new code, is used to speed up the functions on popular platforms
amd64 and aarch64. More platforms can be added if there is
sufficient interest.
For amd64, implementations using just scalar instructions, using
AVX2, and using SHANI (SHA new instructions) are provided. For
aarch64, an implementation using just scalar instructions as well
as one using the widespread sha1 extensions are provided.
For a 10 GiB input file, we get the following performance figures:
On amd64 (11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz):
old: 16.7s ( 613 MB/s) scalar: 14.5s ( 706 MB/s) avx2: 10.5s ( 975 MB/s) shani: 5.6s (1829 MB/s)
On aarch64 (Windows 2023 Dev Kit, ARM Cortex A78C / ARM Cortex X1C):
Performance core: pre 43.1s (238 MB/s) generic 41.3s (247 MB/s) scalar 35.0s (293 MB/s) sha1 12.8s (800 MB/s) Efficiency core: pre 54.2s (189 MB/s) generic 55.9s (183 MB/s) scalar 43.0s (238 MB/s) sha1 16.2s (632 MB/s)