Differential D17398

amd64: make memset less slow with mov stores
ClosedPublic
Actions

Authored by mjg on Oct 4 2018, 1:22 AM.

Tags

None

Referenced Files

	Unknown Object (File)
	Wed, Jul 16, 5:57 AM

	Unknown Object (File)
	Wed, Jul 16, 5:28 AM

	Unknown Object (File)
	Sat, Jul 5, 4:18 AM

	Unknown Object (File)
	Jul 1 2025, 5:15 PM

	Unknown Object (File)
	Jul 1 2025, 2:03 PM

	Unknown Object (File)
	Jun 29 2025, 5:05 AM

	Unknown Object (File)
	Jun 20 2025, 4:04 AM

	Unknown Object (File)
	Jun 19 2025, 7:25 AM

View All 56 Files

Subscribers

Details

Reviewers

Commits

rS339205: amd64: make memset less slow with mov

Summary

rep stos has a high startup time even on modern microarchitectures like Skylake. Intel optimization manuals discuss how for small sizes it is beneficial to go for streaming stores. Since those cannot be used without extra penalty in the kernel I investigated performance impact of just regular movs.

The patch below implements a very simple scheme: a 32-byte loop followed by filling in the remainder of at most 31 bytes. It has a 256 breaking point on which it falls back to rep stos. It provides a significant win over the current primitive on several machines I tested (both Intel and AMD). A 64-byte loop did not provide any benefit even for multiple of 64 sizes.

Note there is still room for improvement. I intend to bring it to libc later as a temporary bandaid until simd use can be implemented.

I graphed a very basic microbenchmark filling different sizes, 'previous libc' refers to the one present prior to recent replacement by the kernel variant.

skylake-memset.png (1×1 px, 95 KB)

skylake-memset.svg76 KBDownload

Test Plan

glibc test suite

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

mjg created this revision.Oct 4 2018, 1:22 AM

Herald added a subscriber: imp. · View Herald TranscriptOct 4 2018, 1:22 AM

Harbormaster completed remote builds in B19946: Diff 48701.Oct 4 2018, 1:22 AM

mjg edited the summary of this revision. (Show Details)Oct 4 2018, 1:24 AM

mjg edited the summary of this revision. (Show Details)Oct 4 2018, 1:28 AM

kib accepted this revision.Oct 4 2018, 11:09 AM

This revision is now accepted and ready to land.Oct 4 2018, 11:09 AM

Closed by commit rS339205: amd64: make memset less slow with mov (authored by mjg). · Explain WhyOct 5 2018, 7:25 PM

This revision was automatically updated to reflect the committed changes.

Revision Contents
Changeset List

Path

Size

head/

sys/

amd64/

amd64/

87 lines

Diff 48796

head/sys/amd64/amd64/support.S

Loading...