Page MenuHomeFreeBSD

amd64: implement memcmp in assembly
ClosedPublic

Authored by mjg on Sep 26 2018, 2:19 PM.
Tags
None
Referenced Files
Unknown Object (File)
Jan 17 2024, 1:06 PM
Unknown Object (File)
Jan 15 2024, 3:51 PM
Unknown Object (File)
Jan 5 2024, 11:33 AM
Unknown Object (File)
Dec 27 2023, 6:29 PM
Unknown Object (File)
Dec 20 2023, 7:26 AM
Unknown Object (File)
Nov 27 2023, 4:12 AM
Unknown Object (File)
Oct 25 2023, 1:20 AM
Unknown Object (File)
Aug 14 2023, 1:13 PM
Subscribers

Details

Summary

The C variant is very slow as it compiles to one byte comparison per loop iteration. rep cmpsb (used by memcmp in libc) turns out to have very bad throughput as well.

The variant below contains an unrolled loop for one byte comparisons and a dedicated 32-byte loop. It significantly outperforms rep cmps even for bigger sizes (e.g. 1024).

Depending on size this is about 3-4 times faster than the current routine.

This is a patch for the kernel, libc will follow later.

Test Plan

glibc test suite, built universe

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

sys/amd64/amd64/support.S
104 ↗(On Diff #48481)

memcmp

105 ↗(On Diff #48481)

rdi,rsi,rdx ?

120 ↗(On Diff #48481)

Why not incq ?

mjg marked an inline comment as done.Sep 27 2018, 6:44 AM
mjg added inline comments.
sys/amd64/amd64/support.S
120 ↗(On Diff #48481)

byte iteration was stolen from what the compiler generated for C variant. I did not play with comparing this vs incq vs leaq 1.

kib added inline comments.
sys/amd64/amd64/support.S
120 ↗(On Diff #48481)

According to the Intel throughput sheet for Skylake, it should be the same. I mean that incq is more natural to use in hand-written asm. Also I am curious.

This revision is now accepted and ready to land.Sep 27 2018, 8:52 AM
mjg marked an inline comment as done.Sep 27 2018, 9:56 AM
mjg added inline comments.
sys/amd64/amd64/support.S
120 ↗(On Diff #48481)

I'll see about microoptimizing these in the next round. Due to time constraints re 12.0 I'm trying to just get simple replacements for current pessimized routines in libc.

This revision was automatically updated to reflect the committed changes.