Page MenuHomeFreeBSD

amd64: make memset less slow with mov stores
ClosedPublic

Authored by mjg on Oct 4 2018, 1:22 AM.
Tags
None
Referenced Files
F106953544: D17398.diff
Wed, Jan 8, 12:25 AM
Unknown Object (File)
Mon, Jan 6, 1:07 AM
Unknown Object (File)
Nov 2 2024, 6:41 PM
Unknown Object (File)
Nov 2 2024, 4:12 PM
Unknown Object (File)
Oct 18 2024, 4:49 AM
Unknown Object (File)
Oct 2 2024, 5:18 PM
Unknown Object (File)
Oct 1 2024, 4:06 AM
Unknown Object (File)
Sep 25 2024, 11:58 PM
Subscribers

Details

Summary

rep stos has a high startup time even on modern microarchitectures like Skylake. Intel optimization manuals discuss how for small sizes it is beneficial to go for streaming stores. Since those cannot be used without extra penalty in the kernel I investigated performance impact of just regular movs.

The patch below implements a very simple scheme: a 32-byte loop followed by filling in the remainder of at most 31 bytes. It has a 256 breaking point on which it falls back to rep stos. It provides a significant win over the current primitive on several machines I tested (both Intel and AMD). A 64-byte loop did not provide any benefit even for multiple of 64 sizes.

Note there is still room for improvement. I intend to bring it to libc later as a temporary bandaid until simd use can be implemented.

I graphed a very basic microbenchmark filling different sizes, 'previous libc' refers to the one present prior to recent replacement by the kernel variant.

skylake-memset.png (1×1 px, 95 KB)

Test Plan

glibc test suite

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable