Page MenuHomeFreeBSD

lib/libmd: reimplement and enhance md5
Needs ReviewPublic

Authored by fuz on Jun 21 2024, 10:27 AM.
Tags
None
Referenced Files
F132446674: D45670.id163978.diff
Fri, Oct 17, 12:39 AM
Unknown Object (File)
Wed, Oct 15, 3:01 PM
Unknown Object (File)
Wed, Oct 15, 2:58 PM
Unknown Object (File)
Wed, Oct 15, 1:35 AM
Unknown Object (File)
Sun, Oct 12, 5:19 PM
Unknown Object (File)
Sun, Oct 12, 4:22 PM
Unknown Object (File)
Sun, Oct 12, 3:41 PM
Unknown Object (File)
Fri, Oct 10, 1:54 AM

Details

Summary

The reimplementation is a bit cleaner than the original code.

Optimised implementations are provided for amd64 and aarch64.
For amd64, we have three implementations. One for baseline,
one using ANDN from BMI1 and one using AVX-512 (though it's not
really vectorised). Here's the performance:

AMD Ryzen 7 PRO 7840U w/ Radeon 780M Graphics (Zen 4):

orig            13.22s   774.6 MB/s
generic         13.50s   758.5 MB/s
baseline        10.83s   945.5 MB/s
bmi1             9.62s  1062.4 MB/s
avx512          10.94s   936.0 MB/s

11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz (Tigerlake):

orig            16.90s   605.9 MB/s
generic         17.42s   587.8 MB/s
baseline        13.38s   765.3 MB/s
bmi1            11.99s   854.0 MB/s
avx512          10.61s   965.1 MB/s

ARM Cortex-X1C (Windows 2023 Dev Kit perf core):

pre	35.2s (291 MB/s)
scalar	34.5s (297 MB/s)

ARM Cortex-A78C (Windows 2023 Dev Kit efficiency core):

pre	46.8s (219 MB/s)
scalar	44.5s (230 MB/s)

The kernel always gets the "generic" version. A macro makes it so that
the copy in stand/libsa is not unrolled, saving precious loader space.

Obtained from: https://github.com/animetosho/md5-optimisation/

Diff Detail

Repository
rG FreeBSD src repository
Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 67714
Build 64597: arc lint + arc unit

Event Timeline

fuz requested review of this revision.Jun 21 2024, 10:27 AM
lwhsu added inline comments.
lib/libmd/aarch64/md5block.S
5

The order is SPDX then owner from style(9).

fuz planned changes to this revision.Oct 1 2024, 3:29 PM

Needs to be refactored after the libmd rework.

  • lib/libmd: add include makefile for md5
  • lib/libmd: reimplement md5
  • lib/libmd: add scalar amd64 asm implementation for md5
  • lib/libmd/amd64: add BMI1 implementation for amd64
  • lib/libmd/amd64/md5block.S: add AVX-512 implementation
  • lib/libmd/aarch64: add md5 scalar implementation

This does things the proper way: a new include Makefile is added
so modules that need a copy of the md5 code (libc, libsa) can grab
it easily. The new md5 code is also used in the kernel (except for
the amd64 avx-512 path, as it's too expensive to turn on the FPU).
Internal symbols are decorated with _libmd_ to avoid collisions.

Can you move md5c.c in a separate change? It makes it easier to review only the changes to the file (if any)

sys/crypto/md5/md5block_aarch64.S
11 ↗(On Diff #144760)

Can you move the macros before ENTRY? It makes it easier to find where the start of the function is if you don't have to skip over them.

175–176 ↗(On Diff #144760)

Why are these commented out?

Can you move md5c.c in a separate change? It makes it easier to review only the changes to the file (if any)

Hi Andy,

The file md5c.c has been rewritten entirely. A diff can be produced, but is pretty much useless. What stays the same is the weak aliases on the bottom. You can find a the desired diff on Github.

sys/crypto/md5/md5block_aarch64.S
11 ↗(On Diff #144760)

I have placed the macros after ENTRY as they “declare” local variables of this function. I can move the ENTRY macro around, but I believe it is cleared to have the declarations of local variables after the ENTRY macro, by analogy to how C functions are laid out.

175–176 ↗(On Diff #144760)

They have been outlined and specialised in the two preceding blocks. I placed the commented out code there to make the correspondence clear; I should perhaps add a comment to make this clearer.

  • sys/crypto/md5/md5c.c: zero out context in MD5Final()
  • sys/crypto/md5/md5block_aarch64.S: improve comments, move ENTRY()
  • lib/libmd/Makefile.md5.inc: only apply -DMD5_ASM to md5c.c

[18:20:04] <@jrtc27> one review for the new C implementation
[18:20:10] <@jrtc27> one review for each new asm implementation on top of that
[18:20:58] <@jrtc27> and this screams missing depend-cleanup.sh changes to me
[18:23:07] <@jrtc27> and if USE_ASM_SOURCES is an interface into Makefile.md5.inc that's way too global a namespace to be using for it
[18:23:16] <@jrtc27> previously it was fine because it didn't leak outside lib/libmd
[18:23:39] <@jrtc27> but USE_ASM_SOURCES=0 in stand/libsa should be screaming that the name is wrong

lib/libc/Makefile
112 ↗(On Diff #144824)

Could hide this down in lib/libc/md/Makefile.inc1 and avoid changing this line?

lib/libmd/Makefile.md5.inc
9 ↗(On Diff #144824)

Why don't these other conditions influence the default of USE_ASM_SOURCES instead? Then that variable isn't a lie.

10 ↗(On Diff #144824)

Can we put these in a directory instead like the other hash functions in lib/libmd/Makefile?

sys/conf/Makefile.amd64
38 ↗(On Diff #144824)

Why aren't these in files.arch? We don't use this pattern in sys/conf, flags come from config(8).

sys/crypto/md5/md5c.c
155 ↗(On Diff #144824)

I doubt you want these in libsa, and have you verified they're actually important for performance?

sys/crypto/md5/md5dispatch_amd64.c
10 ↗(On Diff #144824)

This block of headers isn't style(9)-conformant; they need reordering and separating out into multiple blocks

fuz marked 2 inline comments as done.Jul 13 2025, 9:49 AM
fuz added inline comments.
lib/libc/Makefile
112 ↗(On Diff #144824)

It doesn't really make sense to keep a whole subdirectory and extra include Makefile around for this. The lib/libc/md directory is entirely replaced by this new file.

lib/libmd/Makefile.md5.inc
10 ↗(On Diff #144824)

This file follows the sys/crypto conventions, not the lib/libmd conventions, so it's different from the others. It doesn't make sense to add extra subdirectories to make it follow neither convention properly.

sys/conf/Makefile.amd64
38 ↗(On Diff #144824)

I have tried to solve this with changing the config file, but it didn't work. I do welcome advice for how to do better.

sys/crypto/md5/md5c.c
155 ↗(On Diff #144824)

I have. Otherwise the compiler does not unroll these loops, which prevents the compiler from properly scheduling the function. The old implementation avoided using a loop, instead opting for a macro, to cause a similar effect.

khorben added inline comments.
sys/conf/Makefile.amd64
38 ↗(On Diff #144824)

I've never done this before myself, and this is not tested - but could it look something like the following, in sys/conf/files.amd64:

md5c.c
	compile-with "${NORMAL_C} -DMD5_ASM"

And:

md5block_amd64.S       standard

And then performing this for each architecture where an assembly version is available?

sys/conf/Makefile.amd64
38 ↗(On Diff #144824)

Yes. This is the path. Funky flags like this are generally forbidden i. Makefile.foo.

If it were more than one file, we define a new macro based on NORMAL_C with the added flags.

sys/crypto/md5/md5c.c
155 ↗(On Diff #144824)

So you tested with one compiler. Meh. The danger of over optimizing stand is that ot will cause integrity issues with some compuler, somewhere in the future, possibly leading to unbootable systems. Macros with the same effect are safer.

Have you verified the older, safer versions are too slow to the booting process? I think that was jrtc27's question.

sys/conf/Makefile.amd64
38 ↗(On Diff #144824)

Thanks, I will investigate this route.

It is unfortunate that it has taken over a year for someone to tell me the correct approach; this is not documented anywhere and nobody felt like helping me until I contacted Khorben directly.

sys/crypto/md5/md5c.c
155 ↗(On Diff #144824)

I'm not sure what you mean by “safer.” None of this will be too slow for the boot process. I am submitting this change set because I regularly take md5 checksums of large files to check if they transferred correctly, so I do care about it being as fast as possible.

I can rewrite the code to use macros to force unrolling as before, but that'll achieve the same thing with code that is harder to read overall. I don't see any safety issue either way; the unroll pragma does not change the semantics of the code, it merely forces the compiler to unroll the loop.

Have you verified the older, safer versions are too slow to the booting process? I think that was jrtc27's question.

The old macro based code had the same effect (including on code size) as these unroll pragmas. I can remove them for slower, but smaller code than what we had before if that helps you.

I am not sure why we are having this conversation really; do you not see the value in getting rid of piles of macros and manually unrolled loops in favour of easy to understand loops with instructions to the compiler to unroll them?

sys/conf/Makefile.amd64
38 ↗(On Diff #144824)

You should have asked. This was burried deep in a complex review. It's a 5-10 minute exchange on irc. Whole the lack of docs is a fair jab, lots of examples are there and many others have done it.

sys/conf/Makefile.amd64
38 ↗(On Diff #144824)

I believe I have asked multiple times on IRC, and I think I have even tagged you personally. I did not get a response.

sys/conf/Makefile.amd64
38 ↗(On Diff #144824)

I don't recall seeing that, but in the fiture a quick note to arch@ would likely be a goid qay to escalate

sys/conf/Makefile.amd64
38 ↗(On Diff #144824)

So trying that, I came to the same issue I think I had last time I tried it. To make this work, we have in sys/conf/files:

crypto/md5/md5c.c               standard

and in sys/conf/files.amd64

crypto/md5/md5c.c               standard compile-with "${NORMAL_C} -DMD5_ASM"

but turns out the later setting doesn't overwrite the former! The generated Makefile just compiles with ${NORMAL_C}. If I remove the entry from sys/conf/files, the entry from sys/conf/files.amd64 is applied, but that defeats the purpose of overwriting the generic configuration with a platform-specific configuration.

What do I need to do to work around this issue?

PS: tried to reach you on IRC right now, but got no response.

sys/conf/Makefile.amd64
38 ↗(On Diff #144824)

Yes. You can't do that for any other files.. There's no 'override' concept like this in config(8).
Either you need to put it in each files.X with the special recipe (which is fragile) or you need to take it to the next level by putting

crypto/md5/md5c.c               standard compile-with "${MD5_C}"

in sys/conf/files and then add MD5_C to kern.post.mk

MD5_CFLAGS=
MD5_C=${NORMAL_C} ${MD5_CFLAGS}

and then in sys/conf/Makefile.amd64 have MD5_CFLAGS=-DMD5_ASM

But all of that hoop jumping suggests that a simpler:

#ifdef __amd64__
#define MD5_ASM
#endif

somewhere near the top of md5c.c would be the right way to encode this dependency.

sys/crypto/md5/md5c.c
155 ↗(On Diff #144824)

We've had numerous cases where, due to the weird compile flags we need for other bits of the boot loader that something has broken when highly optimized code like this that does clever tricks is deployed. This were all due to compiler bugs, but we've had to do a lot of tweaks, especially on compiler upgrades, when we've used code like this in the past. It all came to a head a few years ago when ZFS had an arm-optimized routine for one of the non-optional crypto routines without havning a normal, C fallback. Due to the different ABIs / calling conventions, this broke EFI booting (even though we could compile) on arm64 for a time because registers were used inappropriately. It's eaten up enough of my time in the past that I'm super nervous about committing to possibly having code that might cost me more time in the future.

I'm OK with letting this in if you're OK coping with any weird breakage in the diversity of environments that we deploy the bootloader to. that might be caused by this.

first peek at the arm64 assembler

sys/crypto/md5/md5block_aarch64.S
20 ↗(On Diff #144824)

So logically, all of these (and similar below) are to select the right, optimal set of arm instructions for the magnitude of this compile time constant, right? It might make sense to say so since that took some puzzling for me to arrive there. The other full macros have this sort of thing and it seems like it would be a good compromise to say that... or if I've puzzled it out wrong, maybe the right thing.

The whole k vs \k thing too is a bit confusing. Is there some logic to having the parameter named k while there's a different 'k' for the rest of the code, or is it a macro testing thing? It looks like a naked 'k' is register w9...

171 ↗(On Diff #144824)

It might make sense to move these // lines up to above the .Lloop: label.

So looked at the x86 stuff, and had a couple more things to say. One inline. One here: Since the basic algorithm is the same, is there any sane way to use the macros to have one set of "assembly" for algorithm that gets included multiple times for the different strategies to do the rounds? Maybe it's a bit too much wishful thinking, but it seems like it's part way there.

Also, the C code is massively unwound. I'll withhold judgement on that (per our IRC conversation), but is it possible to have some kind of guidepost in that code that ties the assembler and the C code together by adding comments like 'rounds0' or whatever to it?

Anyway, I don't know if these are feasible or not, but I thought I'd suggest them.

Otherwise, the glue code looks good... and the algorithm code looks somewhat similar so is likely correct.

sys/crypto/md5/md5block_amd64.S
366 ↗(On Diff #144824)

So these keys get repeated a lot... looks like maybe 5 times in various formats... Consider whether or not you can have just one copy (or maybe 2 since you unroll the first two round0 in the arm code.

fuz added a subscriber: markj.
  • stand/libsa: don't unroll md5 code when used in bootloader
  • md5c.c: improve speed
  • lib/libmd/amd64/md5dispatch.c: reorder headers
  • md5c.c: bump copyright year
  • amd64/md5block.S: deduplicate round keys
  • aarch64/md5block.S: address comments

This refactor removes most of the various Makefile bits.
We now only have MD5 SIMD in libmd. This was proposed by
@markj at EuroBSDcon.

The generic implementation has been partially unrolled,
as that seems to be needed to make it as fast as the generic
implementation we inherited from OpenSSL. Neverthless, it
can still be unrolled less than the OpenSSL code, which should
translate into some size savings in loader.

Now depends on D52909.

The AVX-512 kernel is not used on AMD, as both Zen 4 and Zen 5
have too high latency in SIMD ops for them to be an advantage.
See https://stackoverflow.com/q/79785695/417501 for some discussion.

New performance figures:

Zen 4

orig            13.22s   774.6 MB/s
generic         13.50s   758.5 MB/s
baseline        10.83s   945.5 MB/s
bmi1             9.62s  1062.4 MB/s
avx512          10.94s   936.0 MB/s

Tigerlake

orig            16.90s   605.9 MB/s
generic         17.42s   587.8 MB/s
baseline        13.38s   765.3 MB/s
bmi1            11.99s   854.0 MB/s
avx512          10.61s   965.1 MB/s
  • amd64/md5dispatch.c: remove _KERNEL conditional
  • md5c.c: fuz@freebsd.org -> fuz@FreeBSD.org
  • amd64: fuz@freebsd.org -> fuz@FreeBSD.org
  • aarch64: fuz@freebsd.org -> fuz@FreeBSD.org

Some copy-editing I forgot to do earlier.

@imp Tested the libsa changes on my workstation, no difference in behaviour observed.

sys/crypto/md5/md5block_aarch64.S
20 ↗(On Diff #144824)

Yes, correct. I'll add a comment and change the naming to be less ambiguous.