Page MenuHomeFreeBSD

amd64: implement memcmp in assembly
ClosedPublic

Authored by mjg on Sep 26 2018, 2:19 PM.
Tags
None
Referenced Files
Unknown Object (File)
Sat, Jan 18, 7:51 AM
Unknown Object (File)
Dec 25 2024, 10:10 PM
Unknown Object (File)
Dec 21 2024, 8:26 AM
Unknown Object (File)
Dec 16 2024, 11:29 AM
Unknown Object (File)
Oct 3 2024, 9:07 PM
Unknown Object (File)
Oct 3 2024, 2:21 AM
Unknown Object (File)
Oct 2 2024, 12:34 PM
Unknown Object (File)
Sep 30 2024, 12:54 AM
Subscribers

Details

Summary

The C variant is very slow as it compiles to one byte comparison per loop iteration. rep cmpsb (used by memcmp in libc) turns out to have very bad throughput as well.

The variant below contains an unrolled loop for one byte comparisons and a dedicated 32-byte loop. It significantly outperforms rep cmps even for bigger sizes (e.g. 1024).

Depending on size this is about 3-4 times faster than the current routine.

This is a patch for the kernel, libc will follow later.

Test Plan

glibc test suite, built universe

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/amd64/amd64/support.S
104 ↗(On Diff #48481)

memcmp

105 ↗(On Diff #48481)

rdi,rsi,rdx ?

120 ↗(On Diff #48481)

Why not incq ?

mjg marked an inline comment as done.Sep 27 2018, 6:44 AM
mjg added inline comments.
sys/amd64/amd64/support.S
120 ↗(On Diff #48481)

byte iteration was stolen from what the compiler generated for C variant. I did not play with comparing this vs incq vs leaq 1.

kib added inline comments.
sys/amd64/amd64/support.S
120 ↗(On Diff #48481)

According to the Intel throughput sheet for Skylake, it should be the same. I mean that incq is more natural to use in hand-written asm. Also I am curious.

This revision is now accepted and ready to land.Sep 27 2018, 8:52 AM
mjg marked an inline comment as done.Sep 27 2018, 9:56 AM
mjg added inline comments.
sys/amd64/amd64/support.S
120 ↗(On Diff #48481)

I'll see about microoptimizing these in the next round. Due to time constraints re 12.0 I'm trying to just get simple replacements for current pessimized routines in libc.

This revision was automatically updated to reflect the committed changes.