Details

Reviewers

kevans
emaste
allanjude
andrew
manu
getz
markj
imp

Commits

rGf7874faf3c39: lib/libmd: import aarch64 md5 SIMD implementation
rG28a0f0c845d1: lib/libmd: import md5 amd64 kernels
rGf0b9cf9dd798: sys/crypto: import md5c.c C rewrite
rGd92e98742100: lib/libmd: import md5 amd64 kernels
rGc1135b2b54bf: lib/libmd: import aarch64 md5 SIMD implementation
rG90fa789cfab5: sys/crypto: import md5c.c C rewrite

Summary

The reimplementation is a bit cleaner than the original code.

Optimised implementations are provided for amd64 and aarch64.
For amd64, we have three implementations. One for baseline,
one using ANDN from BMI1 and one using AVX-512 (though it's not
really vectorised). Here's the performance:

AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics (Zen 4):

orig            13.22s   774.6 MB/s
generic         13.50s   758.5 MB/s
baseline        10.83s   945.5 MB/s
bmi1             9.62s  1062.4 MB/s
avx512          10.94s   936.0 MB/s

11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (Tigerlake):

orig            16.90s   605.9 MB/s
generic         17.42s   587.8 MB/s
baseline        13.38s   765.3 MB/s
bmi1            11.99s   854.0 MB/s
avx512          10.61s   965.1 MB/s

ARM Cortex-X1C (Windows 2023 Dev Kit perf core):

pre	35.2s (291 MB/s)
scalar	34.5s (297 MB/s)

ARM Cortex-A78C (Windows 2023 Dev Kit efficiency core):

pre	46.8s (219 MB/s)
scalar	44.5s (230 MB/s)

The kernel always gets the "generic" version. A macro makes it so that
the copy in stand/libsa is not unrolled, saving precious loader space.

Obtained from: https://github.com/animetosho/md5-optimisation/

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 58276
Build 55164: arc lint + arc unit

Event Timeline

fuz created this revision.Jun 21 2024, 10:27 AM

Herald added a reviewer: andrew. · View Herald TranscriptJun 21 2024, 10:27 AM

Herald added a subscriber: imp. · View Herald Transcript

fuz requested review of this revision.Jun 21 2024, 10:27 AM

Harbormaster completed remote builds in B58276: Diff 140065.Jun 21 2024, 10:27 AM

lwhsu added a subscriber: lwhsu.Jun 24 2024, 8:29 AM

lwhsu added inline comments.

lib/libmd/aarch64/md5block.S
4	The order is SPDX then owner from style(9).

Needs to be refactored after the libmd rework.

lib/libmd: add include makefile for md5
lib/libmd: reimplement md5
lib/libmd: add scalar amd64 asm implementation for md5
lib/libmd/amd64: add BMI1 implementation for amd64
lib/libmd/amd64/md5block.S: add AVX-512 implementation
lib/libmd/aarch64: add md5 scalar implementation

This does things the proper way: a new include Makefile is added
so modules that need a copy of the md5 code (libc, libsa) can grab
it easily. The new md5 code is also used in the kernel (except for
the amd64 avx-512 path, as it's too expensive to turn on the FPU).
Internal symbols are decorated with _libmd_ to avoid collisions.

Herald added a reviewer: manu. · View Herald TranscriptOct 12 2024, 11:03 PM

Herald added subscribers: olce, dab. · View Herald Transcript

Harbormaster completed remote builds in B59949: Diff 144760.Oct 12 2024, 11:03 PM

fuz added parent revisions: D47062: lib/libcrypt: unbundle hash functions, D47061: lib/geom: remove redundant libraries and objects.Oct 12 2024, 11:04 PM

Can you move md5c.c in a separate change? It makes it easier to review only the changes to the file (if any)

sys/crypto/md5/md5block_aarch64.S
11 ↗	(On Diff #144760)	Can you move the macros before `ENTRY`? It makes it easier to find where the start of the function is if you don't have to skip over them.
175–176 ↗	(On Diff #144760)	Why are these commented out?

In D45670#1073713, @andrew wrote:

Can you move md5c.c in a separate change? It makes it easier to review only the changes to the file (if any)

Hi Andy,

The file md5c.c has been rewritten entirely. A diff can be produced, but is pretty much useless. What stays the same is the weak aliases on the bottom. You can find a the desired diff on Github.

sys/crypto/md5/md5block_aarch64.S
11 ↗	(On Diff #144760)	I have placed the macros after `ENTRY` as they “declare” local variables of this function. I can move the `ENTRY` macro around, but I believe it is cleared to have the declarations of local variables after the `ENTRY` macro, by analogy to how C functions are laid out.
175–176 ↗	(On Diff #144760)	They have been outlined and specialised in the two preceding blocks. I placed the commented out code there to make the correspondence clear; I should perhaps add a comment to make this clearer.

sys/crypto/md5/md5c.c: zero out context in MD5Final()
sys/crypto/md5/md5block_aarch64.S: improve comments, move ENTRY()
lib/libmd/Makefile.md5.inc: only apply -DMD5_ASM to md5c.c

Harbormaster completed remote builds in B59973: Diff 144824.Oct 14 2024, 2:41 PM

[18:20:04] <@jrtc27> one review for the new C implementation
[18:20:10] <@jrtc27> one review for each new asm implementation on top of that
[18:20:58] <@jrtc27> and this screams missing depend-cleanup.sh changes to me
[18:23:07] <@jrtc27> and if USE_ASM_SOURCES is an interface into Makefile.md5.inc that's way too global a namespace to be using for it
[18:23:16] <@jrtc27> previously it was fine because it didn't leak outside lib/libmd
[18:23:39] <@jrtc27> but USE_ASM_SOURCES=0 in stand/libsa should be screaming that the name is wrong

jrtc27 added inline comments.Oct 23 2024, 5:37 PM

lib/libc/Makefile
112 ↗	(On Diff #144824)	Could hide this down in lib/libc/md/Makefile.inc1 and avoid changing this line?
lib/libmd/Makefile.md5.inc
9 ↗	(On Diff #144824)	Why don't these other conditions influence the default of USE_ASM_SOURCES instead? Then that variable isn't a lie.
10 ↗	(On Diff #144824)	Can we put these in a directory instead like the other hash functions in lib/libmd/Makefile?
sys/conf/Makefile.amd64
38 ↗	(On Diff #144824)	Why aren't these in files.arch? We don't use this pattern in sys/conf, flags come from config(8).
sys/crypto/md5/md5c.c
155 ↗	(On Diff #144824)	I doubt you want these in libsa, and have you verified they're actually important for performance?
sys/crypto/md5/md5dispatch_amd64.c
10 ↗	(On Diff #144824)	This block of headers isn't style(9)-conformant; they need reordering and separating out into multiple blocks

fuz added a reviewer: getz.May 10 2025, 1:50 PM

fuz marked 2 inline comments as done.Jul 13 2025, 9:49 AM

fuz added inline comments.

lib/libc/Makefile
112 ↗	(On Diff #144824)	It doesn't really make sense to keep a whole subdirectory and extra include Makefile around for this. The `lib/libc/md` directory is entirely replaced by this new file.
lib/libmd/Makefile.md5.inc
10 ↗	(On Diff #144824)	This file follows the `sys/crypto` conventions, not the `lib/libmd` conventions, so it's different from the others. It doesn't make sense to add extra subdirectories to make it follow neither convention properly.
sys/conf/Makefile.amd64
38 ↗	(On Diff #144824)	I have tried to solve this with changing the config file, but it didn't work. I do welcome advice for how to do better.
sys/crypto/md5/md5c.c
155 ↗	(On Diff #144824)	I have. Otherwise the compiler does not unroll these loops, which prevents the compiler from properly scheduling the function. The old implementation avoided using a loop, instead opting for a macro, to cause a similar effect.

khorben added a subscriber: khorben.Jul 13 2025, 1:58 PM

khorben added inline comments.

sys/conf/Makefile.amd64
38 ↗	(On Diff #144824)	I've never done this before myself, and this is not tested - but could it look something like the following, in `sys/conf/files.amd64`: md5c.c compile-with "${NORMAL_C} -DMD5_ASM" And: md5block_amd64.S standard And then performing this for each architecture where an assembly version is available?

imp added inline comments.Jul 13 2025, 2:09 PM

sys/conf/Makefile.amd64
38 ↗	(On Diff #144824)	Yes. This is the path. Funky flags like this are generally forbidden i. Makefile.foo. If it were more than one file, we define a new macro based on NORMAL_C with the added flags.
sys/crypto/md5/md5c.c
155 ↗	(On Diff #144824)	So you tested with one compiler. Meh. The danger of over optimizing stand is that ot will cause integrity issues with some compuler, somewhere in the future, possibly leading to unbootable systems. Macros with the same effect are safer. Have you verified the older, safer versions are too slow to the booting process? I think that was jrtc27's question.

fuz added inline comments.Jul 13 2025, 2:22 PM

sys/conf/Makefile.amd64
38 ↗	(On Diff #144824)	Thanks, I will investigate this route. It is unfortunate that it has taken over a year for someone to tell me the correct approach; this is not documented anywhere and nobody felt like helping me until I contacted Khorben directly.
sys/crypto/md5/md5c.c
155 ↗	(On Diff #144824)	I'm not sure what you mean by “safer.” None of this will be too slow for the boot process. I am submitting this change set because I regularly take md5 checksums of large files to check if they transferred correctly, so I do care about it being as fast as possible. I can rewrite the code to use macros to force unrolling as before, but that'll achieve the same thing with code that is harder to read overall. I don't see any safety issue either way; the unroll pragma does not change the semantics of the code, it merely forces the compiler to unroll the loop. Have you verified the older, safer versions are too slow to the booting process? I think that was jrtc27's question. The old macro based code had the same effect (including on code size) as these unroll pragmas. I can remove them for slower, but smaller code than what we had before if that helps you. I am not sure why we are having this conversation really; do you not see the value in getting rid of piles of macros and manually unrolled loops in favour of easy to understand loops with instructions to the compiler to unroll them?

imp added inline comments.Jul 13 2025, 2:26 PM

sys/conf/Makefile.amd64
38 ↗	(On Diff #144824)	You should have asked. This was burried deep in a complex review. It's a 5-10 minute exchange on irc. Whole the lack of docs is a fair jab, lots of examples are there and many others have done it.

fuz added inline comments.Jul 13 2025, 2:29 PM

sys/conf/Makefile.amd64
38 ↗	(On Diff #144824)	I believe I have asked multiple times on IRC, and I think I have even tagged you personally. I did not get a response.

imp added inline comments.Jul 13 2025, 2:35 PM

sys/conf/Makefile.amd64
38 ↗	(On Diff #144824)	I don't recall seeing that, but in the fiture a quick note to arch@ would likely be a goid qay to escalate

fuz added inline comments.Jul 13 2025, 2:42 PM

sys/conf/Makefile.amd64
38 ↗	(On Diff #144824)	So trying that, I came to the same issue I think I had last time I tried it. To make this work, we have in `sys/conf/files`: crypto/md5/md5c.c standard and in `sys/conf/files.amd64` crypto/md5/md5c.c standard compile-with "${NORMAL_C} -DMD5_ASM" but turns out the later setting doesn't overwrite the former! The generated Makefile just compiles with `${NORMAL_C}`. If I remove the entry from `sys/conf/files`, the entry from `sys/conf/files.amd64` is applied, but that defeats the purpose of overwriting the generic configuration with a platform-specific configuration. What do I need to do to work around this issue? PS: tried to reach you on IRC right now, but got no response.

imp added inline comments.Jul 13 2025, 3:49 PM

sys/conf/Makefile.amd64
38 ↗	(On Diff #144824)	Yes. You can't do that for any other files.. There's no 'override' concept like this in config(8). Either you need to put it in each files.X with the special recipe (which is fragile) or you need to take it to the next level by putting crypto/md5/md5c.c standard compile-with "${MD5_C}" in `sys/conf/files` and then add MD5_C to kern.post.mk MD5_CFLAGS= MD5_C=${NORMAL_C} ${MD5_CFLAGS} and then in sys/conf/Makefile.amd64 have `MD5_CFLAGS=-DMD5_ASM` But all of that hoop jumping suggests that a simpler: #ifdef __amd64__ #define MD5_ASM #endif somewhere near the top of md5c.c would be the right way to encode this dependency.
sys/crypto/md5/md5c.c
155 ↗	(On Diff #144824)	We've had numerous cases where, due to the weird compile flags we need for other bits of the boot loader that something has broken when highly optimized code like this that does clever tricks is deployed. This were all due to compiler bugs, but we've had to do a lot of tweaks, especially on compiler upgrades, when we've used code like this in the past. It all came to a head a few years ago when ZFS had an arm-optimized routine for one of the non-optional crypto routines without havning a normal, C fallback. Due to the different ABIs / calling conventions, this broke EFI booting (even though we could compile) on arm64 for a time because registers were used inappropriately. It's eaten up enough of my time in the past that I'm super nervous about committing to possibly having code that might cost me more time in the future. I'm OK with letting this in if you're OK coping with any weird breakage in the diversity of environments that we deploy the bootloader to. that might be caused by this.

first peek at the arm64 assembler

sys/crypto/md5/md5block_aarch64.S
20 ↗	(On Diff #144824)	So logically, all of these (and similar below) are to select the right, optimal set of arm instructions for the magnitude of this compile time constant, right? It might make sense to say so since that took some puzzling for me to arrive there. The other full macros have this sort of thing and it seems like it would be a good compromise to say that... or if I've puzzled it out wrong, maybe the right thing. The whole k vs \k thing too is a bit confusing. Is there some logic to having the parameter named k while there's a different 'k' for the rest of the code, or is it a macro testing thing? It looks like a naked 'k' is register w9...
171 ↗	(On Diff #144824)	It might make sense to move these // lines up to above the .Lloop: label.

So looked at the x86 stuff, and had a couple more things to say. One inline. One here: Since the basic algorithm is the same, is there any sane way to use the macros to have one set of "assembly" for algorithm that gets included multiple times for the different strategies to do the rounds? Maybe it's a bit too much wishful thinking, but it seems like it's part way there.

Also, the C code is massively unwound. I'll withhold judgement on that (per our IRC conversation), but is it possible to have some kind of guidepost in that code that ties the assembler and the C code together by adding comments like 'rounds0' or whatever to it?

Anyway, I don't know if these are feasible or not, but I thought I'd suggest them.

Otherwise, the glue code looks good... and the algorithm code looks somewhat similar so is likely correct.

sys/crypto/md5/md5block_amd64.S
366 ↗	(On Diff #144824)	So these keys get repeated a lot... looks like maybe 5 times in various formats... Consider whether or not you can have just one copy (or maybe 2 since you unroll the first two round0 in the arm code.

fuz mentioned this in D52784: libc/resolv: get rid of MD5.Sep 29 2025, 5:53 PM

fuz mentioned this in rGd518f64cef6d: libc/resolv: get rid of MD5.Oct 4 2025, 8:55 AM

fuz mentioned this in D52909: sys: move sys/kern/md[45].c to sys/crypto.Oct 4 2025, 9:47 PM

fuz added a parent revision: D52909: sys: move sys/kern/md[45].c to sys/crypto.Oct 4 2025, 9:51 PM

stand/libsa: don't unroll md5 code when used in bootloader
md5c.c: improve speed
lib/libmd/amd64/md5dispatch.c: reorder headers
md5c.c: bump copyright year
amd64/md5block.S: deduplicate round keys
aarch64/md5block.S: address comments

This refactor removes most of the various Makefile bits.
We now only have MD5 SIMD in libmd. This was proposed by
@markj at EuroBSDcon.

The generic implementation has been partially unrolled,
as that seems to be needed to make it as fast as the generic
implementation we inherited from OpenSSL. Neverthless, it
can still be unrolled less than the OpenSSL code, which should
translate into some size savings in loader.

Now depends on D52909.

The AVX-512 kernel is not used on AMD, as both Zen 4 and Zen 5
have too high latency in SIMD ops for them to be an advantage.
See https://stackoverflow.com/q/79785695/417501 for some discussion.

New performance figures:

Zen 4

orig            13.22s   774.6 MB/s
generic         13.50s   758.5 MB/s
baseline        10.83s   945.5 MB/s
bmi1             9.62s  1062.4 MB/s
avx512          10.94s   936.0 MB/s

Tigerlake

orig            16.90s   605.9 MB/s
generic         17.42s   587.8 MB/s
baseline        13.38s   765.3 MB/s
bmi1            11.99s   854.0 MB/s
avx512          10.61s   965.1 MB/s

Harbormaster completed remote builds in B67711: Diff 163974.Oct 10 2025, 8:12 PM

fuz edited the summary of this revision. (Show Details)Oct 10 2025, 8:17 PM