Page MenuHomeFreeBSD

am64: further depessimize kernel string ops
AbandonedPublic

Authored by mjg on Sep 23 2018, 1:58 PM.
Tags
None
Referenced Files
Unknown Object (File)
Jan 1 2024, 6:28 AM
F3777515: memmove-epyc.png
Sep 23 2018, 1:58 PM
F3777528: memmove-skylake.png
Sep 23 2018, 1:58 PM
Subscribers

Details

Reviewers
kib
Summary

Turns out even very naive byte and 8 byte copy loops severely outperform ERMS-enabled CPUs. The patch below modifies memset, memmove and memcpy (copyin and copyout will be taken care of separately). Part of the goal is to get depessimized mid-point for libc before time can be spent trying to do anything fancy there. The cut off point of 128 is somewhat arbitrary, chances are solid it can be moved higher up with less naive loops. I did not play with alignment.

I benched memmove of sizes 0-144 on Ivy Bridge, Westmere, Broadwell, Skylake and EPYC (note 2 don't have ERMS) and all of them noted a marked improvement. Both target and source buffers were always aligned to 8 though.

Note there is still room for improvement, but this is a great stepping stone imho.

Below are results of hot cache memmoves in a loop from two boxes. x axis is number of bytes, left axis is ops/s. I did not scale units as they are not important (note ERMS results on EPYC are bad as expected since it does not have the bit).

memmove-epyc.png (726×634 px, 68 KB)

memmove-skylake.png (726×634 px, 70 KB)

Test Plan

glibc test suite, passes

Diff Detail

Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

mjg edited the summary of this revision. (Show Details)
mjg edited the summary of this revision. (Show Details)

ops, uploaded the wrong diff

I'm experimenting with non-naive code and I'm getting significantly better results. Will create a new revision once I settle on something.