More bcmp "optimization"
AbandonedPublic
Actions

Authored by jtl on May 18 2018, 7:18 PM.

Details

Reviewers

kib
mjg
imp

Summary

Do some more bcmp "optimization". (That's in quotes because "optimization" is rarely straightforward. What "optimizes" one person's use case may well pessimize another's.)

Main changes:

Do 8-byte comparisons in a loop prior to doing 1-byte comparisons in a loop.
In the large case, do 1-byte comparisons in a loop (instead of repe cmpsb).

In a user-space test of many different memory blocks with different sizes (up to 105 bytes) and different alignments, this seems to reduce the run time by 42%.

Additionally, running the same test @mjg ran (pmcstat to capture samples of bcmp), bcmp fell from being in 0.7% of samples to not showing up on the list at all.

One nagging worry is whether making this function larger will negatively impact other aspects of the system.

Test Plan

I've run a variety of correctness tests and (I think) verified correct behavior with a variety of alignments/sizes/data patterns. (Tests available on request.)

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Passed

Unit

No Test Coverage

Build Status

Buildable 16676
Build 16581: arc lint + arc unit

Event Timeline

jtl created this revision.May 18 2018, 7:18 PM

Harbormaster completed remote builds in B16676: Diff 42709.May 18 2018, 7:18 PM

By the way, measurements were taken on an Intel E5-2697A v4 (32-core Broadwell).

mjg retitled this revision from More bzero "optimization" to More bcmp "optimization".May 18 2018, 7:31 PM

mjg edited the summary of this revision. (Show Details)

First a minor note is that I took the liberty of s/bzero/bcmp, which I presume was intended.

It is unclear to me how much you are willing to invest into the problem.

In general the rep & compatible stuff benefits from being avoided for small sizes. As in, there is a significant room for improvement in memset, memcpy and others. From basic tests, even with ERMS present and utilized, it is way faster to loop over sizes like 32 or 64 bytes instead of spinning up rep stosb. Are you interested in fighting off the remaining primitives? Part of the suggestion is to replace bzero with builtin_memset, bcopy with builtin_memmove and add relevant builtins for the rest (memmove, memset, memcpy). imp has a partial patch for that afair.

Part of why I left my original bcmp commit in the trivial state is that bcmp itself can be replaced with __builtin_memcmp.

Here I are sample buf sizes I recorded for buildkernel while running custom hacks: https://people.freebsd.org/~mjg/bufsizes.txt
dtrace -n 'fbt:kernel:memset_erms:entry,fbt:kernel:memmove_erms:entry,fbt:kernel:memcpy_erms:entry,fbt:kernel:copyin:entry,fbt:kernel:copyinstr:entry,fbt:kernel:copyout:entry { @[probefunc] = lquantize(arg2, 0, 256, 1); }'

The scheme is trivial as you can see, trivially adjustable for -head for your own measurements.

To make it less chaotic, here is how I think a *decent* state would look like:

all the primitives use __builtin variants (memset, memmove, memcpy). also bzero -> memset, bcopy -> memmove, bcmp -> memcmp. their assembly variants get removed.
there are nice wins from avoiding the rep stosq/movsq -> rep stosb/movsb combo. i.e. testing for any trailing bytes left is faster than spinning up rep stosb and provides next to no slodown for the case where we got some.
as seen from the bufsizes.txt file, optimisations for at least multiple-of-8 are most welcome, other sizes are probably fine too. so in particular the proposed change to bcmp in its spirit is fine, but also should be replicated to others
the majority of bzero (later memset) calls are coming from malloc(..., M_ZERO). the passed sizes are almost always known at compilation time, but that information is lost when passed to the C routine. malloc can be converted into a macro which branches on flags being __builtin_constant_p and containg M_ZERO. If so, it does not pass that flag into malloc and memsets in place. This will allow the compiler to optimize the stores.
in general most common uses of the routines have to be tracked down and possibly eliminated in the first place. in particular there is an amazingly avoidable memcpy during stat.
copyin, copyout & friends should probably get size-based variants to avoid excessive branching
the use of bcmp in the namecache can be optimized by guaranteeing all the passed buffers are a multiple of 8 in size, which in turn can use word-sized comparisons all the way without initial branching on size. a dedicated inline routine can be added for this purpose. Additionally vast majority of names are shorter than 8 and your proposed change makes it slower than necessary by adding a forward jump. If not going for the full optimisation, at the very least the namecache should have a dedicated routine which just always does byte-sized compares.
while ERMS is desirable, turns out the current ifunc support results in indirect jumps being generated. That's pessimal for AMD and reduces the win from employing the bit in the first place. I don't know what the long term fix is here. We can get away without ERMS decently enough if the combos mentioned earlier are avoided.

In other words the general state of these primitives is rather poor and requires substantial amount of work which waits for someone to take. The proposed bcmp change is ok-ish for me, but if going for anything remotely fancy (like here) I would strongly prefer the usage to get fixed first (i.e..: use __builtin_memcmp and in effect eliminate several calls to begin with and exclude the namecache) and only then have measurements performed.

What's your take on this, are you interested in pushing this all the way through?

I have a straw-man patch that does some of what mjg is talking about. It doesn't remove the b* functions though. It's not quite ready for sharing, though if there's interest i'll polish up enough to publish.

In D15483#326763, @mjg wrote:

First a minor note is that I took the liberty of s/bzero/bcmp, which I presume was intended.

It was. Thanks!

It is unclear to me how much you are willing to invest into the problem.

Yes, that's unclear to me, too. :-)

However, I'm willing to invest some time in the problem. It seems to me that these routines are called often enough that it is worth optimizing them. And, I don't mind reading/writing/pondering assembly.

In general the rep & compatible stuff benefits from being avoided for small sizes. As in, there is a significant room for improvement in memset, memcpy and others. From basic tests, even with ERMS present and utilized, it is way faster to loop over sizes like 32 or 64 bytes instead of spinning up rep stosb. Are you interested in fighting off the remaining primitives?

Yes.

Part of the suggestion is to replace bzero with builtin_memset, bcopy with builtin_memmove and add relevant builtins for the rest (memmove, memset, memcpy). imp has a partial patch for that afair.

Part of why I left my original bcmp commit in the trivial state is that bcmp itself can be replaced with __builtin_memcmp.

Yes, and that may be best. However, memcmp needs to keep state that bcmp doesn't. For small sizes, I would hope that the compiler would figure out we only want a boolean answer and optimize out the work to track which byte diverged. For larger sizes, keeping that extra state may be cheap enough that it is irrelevant. However, I think we need to prove those things (unless someone already has).

all the primitives use __builtin variants (memset, memmove, memcpy). also bzero -> memset, bcopy -> memmove, bcmp -> memcmp. their assembly variants get removed.

there are nice wins from avoiding the rep stosq/movsq -> rep stosb/movsb combo. i.e. testing for any trailing bytes left is faster than spinning up rep stosb and provides next to no slodown for the case where we got some.

as seen from the bufsizes.txt file, optimisations for at least multiple-of-8 are most welcome, other sizes are probably fine too. so in particular the proposed change to bcmp in its spirit is fine, but also should be replicated to others

the majority of bzero (later memset) calls are coming from malloc(..., M_ZERO). the passed sizes are almost always known at compilation time, but that information is lost when passed to the C routine. malloc can be converted into a macro which branches on flags being __builtin_constant_p and containg M_ZERO. If so, it does not pass that flag into malloc and memsets in place. This will allow the compiler to optimize the stores.

In general, I'm willing to tackle the above items (with the caveat that I'm not going to duplicate @imp's work, and I want to test bcmp->memcmp before committing to it).

in general most common uses of the routines have to be tracked down and possibly eliminated in the first place. in particular there is an amazingly avoidable memcpy during stat.

copyin, copyout & friends should probably get size-based variants to avoid excessive branching

the use of bcmp in the namecache can be optimized by guaranteeing all the passed buffers are a multiple of 8 in size, which in turn can use word-sized comparisons all the way without initial branching on size. a dedicated inline routine can be added for this purpose. Additionally vast majority of names are shorter than 8 and your proposed change makes it slower than necessary by adding a forward jump. If not going for the full optimisation, at the very least the namecache should have a dedicated routine which just always does byte-sized compares.

I don't think I'm willing to commit to the above three items yet. (I may have time for them, but don't want to commit to it.)

while ERMS is desirable, turns out the current ifunc support results in indirect jumps being generated. That's pessimal for AMD and reduces the win from employing the bit in the first place. I don't know what the long term fix is here. We can get away without ERMS decently enough if the combos mentioned earlier are avoided.

Right. Let's work on fixing the general case first. We might be able to do this with some fancy boot-time rewriting, but let's fix the general case first. :-)

In other words the general state of these primitives is rather poor and requires substantial amount of work which waits for someone to take. The proposed bcmp change is ok-ish for me, but if going for anything remotely fancy (like here) I would strongly prefer the usage to get fixed first (i.e..: use __builtin_memcmp and in effect eliminate several calls to begin with and exclude the namecache) and only then have measurements performed.

What's your take on this, are you interested in pushing this all the way through?

See above.

I'll work with @imp to see where we can get.

So a lot of the previously mentioned list was taken care of. As a side effect bcmp was removed in favor of memcmp. The current kernel implementation is c-based and the asm variant in libc suffers all the same problems original bcopy did. This patch can be updated to make bcopy act as memcpy and thus be both a viable replacement for libc and the kernel. If you have no time/interest in doing it, I can take care of it. I definitely would like to see this done in time for 12.0

Note that it is somewhat of a pessimization for kernel: the most frequent consumer is the vfs namecache which passes the size of 3, which now starts with a jump forward. I would argue it would be best to leave the short case in front for both kernel and userspace.

The routine was essentially rewritten in another way last year. There are still ways to improve it but they go far beyond the scope of this patch. Can we close this revision? (cleaning up my phab)

jtl abandoned this revision.Sep 3 2019, 6:15 PM

Revision Contents
Changeset List

Path

Size

sys/

amd64/

support.S

71 lines

Diff 42709

View Options

More bcmp "optimization"AbandonedPublicActions