amd64: implement memcmp in assembly
ClosedPublic
Actions

Authored by mjg on Sep 26 2018, 2:19 PM.

Details

Reviewers

Commits

rS338963: amd64: implement memcmp in assembly

Summary

The C variant is very slow as it compiles to one byte comparison per loop iteration. rep cmpsb (used by memcmp in libc) turns out to have very bad throughput as well.

The variant below contains an unrolled loop for one byte comparisons and a dedicated 32-byte loop. It significantly outperforms rep cmps even for bigger sizes (e.g. 1024).

Depending on size this is about 3-4 times faster than the current routine.

This is a patch for the kernel, libc will follow later.

Test Plan

glibc test suite, built universe

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

mjg created this revision.Sep 26 2018, 2:19 PM

Herald added subscribers: emaste, andrew, imp. · View Herald TranscriptSep 26 2018, 2:19 PM

Harbormaster completed remote builds in B19824: Diff 48481.Sep 26 2018, 2:19 PM

kib added inline comments.Sep 26 2018, 10:41 PM

sys/amd64/amd64/support.S
104 ↗	(On Diff #48481)	memcmp
105 ↗	(On Diff #48481)	rdi,rsi,rdx ?
120 ↗	(On Diff #48481)	Why not incq ?

comments

mjg marked an inline comment as done.Sep 27 2018, 6:44 AM

mjg added inline comments.

sys/amd64/amd64/support.S
120 ↗	(On Diff #48481)	byte iteration was stolen from what the compiler generated for C variant. I did not play with comparing this vs incq vs leaq 1.

kib accepted this revision.Sep 27 2018, 8:52 AM

kib added inline comments.

sys/amd64/amd64/support.S
120 ↗	(On Diff #48481)	According to the Intel throughput sheet for Skylake, it should be the same. I mean that incq is more natural to use in hand-written asm. Also I am curious.

This revision is now accepted and ready to land.Sep 27 2018, 8:52 AM

mjg marked an inline comment as done.Sep 27 2018, 9:56 AM

mjg added inline comments.

sys/amd64/amd64/support.S
120 ↗	(On Diff #48481)	I'll see about microoptimizing these in the next round. Due to time constraints re 12.0 I'm trying to just get simple replacements for current pessimized routines in libc.