One thing I am wondering if newer compilers have intrinsic heuristics for common lib calls like this? Do you see the same performance delta with gcc8 from ports?

Improve optimization

Harbormaster completed remote builds in B16459: Diff 42227.May 7 2018, 12:19 PM

In D15220#320868, @kbowling wrote:

One thing I am wondering if newer compilers have intrinsic heuristics for common lib calls like this? Do you see the same performance delta with gcc8 from ports?

The gcc8 really improves the performance of the lib.
However, this optimization has a better performance than the current code implemented in C.
I improved the previous code and re-executed the tests using gcc8.
These are the gain rates obtained:

String size (Bytes)	<= 8	16	32	64	128	256	512	1024	2048	4096
Gain rate	-0.27 %	1.46 %	2.49 %	4.20 %	6.57 %	12.75 %	21.76 %	34.71 %	49.47 %	62.09 %

I am implementing a version of this code with vectorization and I will send this code later.
Thanks, @kbowling .

My question was a little bit more naive, I don't know a lot about industrial compilers. I wondered if for the compiler/rtlib hosted its own builtins for things like this and how that worked. After digging through llvm I think the answer to that is no. llvm was a bit easier to digest than gcc, with what it calls builtins it will simplify library calls and can do other things like inline them with MI and MD information, but I believe it would fall back to the OS libc like this for the machine dependent implementation.

alexandre.yamashita_eldorado.org.br added a project: PowerPC.May 9 2018, 3:53 PM

Replace spaces by tabs

Harbormaster completed remote builds in B16668: Diff 42698.May 18 2018, 3:09 PM

In D15220#323794, @kbowling wrote:

My question was a little bit more naive, I don't know a lot about industrial compilers. I wondered if for the compiler/rtlib hosted its own builtins for things like this and how that worked. After digging through llvm I think the answer to that is no. llvm was a bit easier to digest than gcc, with what it calls builtins it will simplify library calls and can do other things like inline them with MI and MD information, but I believe it would fall back to the OS libc like this for the machine dependent implementation.

I implemented a vectorized version of this code, but I could not obtain significant improvements in the performance rates.
I shared this version on https://github.com/PPC64/freebsd/blob/41388369cb7db55b995debd99c2a514409edd56a/lib/libc/powerpc64/string/strcmp.S .

Could you review this code without vectorization?

In D15220#326646, @alexandre.yamashita_eldorado.org.br wrote:

In D15220#323794, @kbowling wrote:

My question was a little bit more naive, I don't know a lot about industrial compilers. I wondered if for the compiler/rtlib hosted its own builtins for things like this and how that worked. After digging through llvm I think the answer to that is no. llvm was a bit easier to digest than gcc, with what it calls builtins it will simplify library calls and can do other things like inline them with MI and MD information, but I believe it would fall back to the OS libc like this for the machine dependent implementation.

I implemented a vectorized version of this code, but I could not obtain significant improvements in the performance rates.
I shared this version on https://github.com/PPC64/freebsd/blob/41388369cb7db55b995debd99c2a514409edd56a/lib/libc/powerpc64/string/strcmp.S .

Could you review this code without vectorization?

Hi. Could someone review my patch please?
Basically, this code loads and compares the chars by byte until their addresses are aligned.
After, we load and compare the chars by double word until a \0 or a difference in the chars is found.
Finally, we compute the differences between the chars.

In D15220#328396, @alexandre.yamashita_eldorado.org.br wrote:

Hi. Could someone review my patch please?
Basically, this code loads and compares the chars by byte until their addresses are aligned.
After, we load and compare the chars by double word until a \0 or a difference in the chars is found.
Finally, we compute the differences between the chars.

Hi Alexandre,

The code looks fine to me, but I do have one question.

Not to diminish the work you put into this, but has any effort been put into finding just how often strcmp() is run over strings longer than 64 characters, in the wild? If the vast majority of the comparisons are on "short" strings, the performance improvements seem less effective. But, if it does turn out that moderate to large strings are often compared, then this is certainly a big win.

In D15220#328403, @jhibbits wrote:

In D15220#328396, @alexandre.yamashita_eldorado.org.br wrote:

Hi. Could someone review my patch please?
Basically, this code loads and compares the chars by byte until their addresses are aligned.
After, we load and compare the chars by double word until a \0 or a difference in the chars is found.
Finally, we compute the differences between the chars.

Hi Alexandre,

The code looks fine to me, but I do have one question.

Not to diminish the work you put into this, but has any effort been put into finding just how often strcmp() is run over strings longer than 64 characters, in the wild? If the vast majority of the comparisons are on "short" strings, the performance improvements seem less effective. But, if it does turn out that moderate to large strings are often compared, then this is certainly a big win.

I couldn't find a solution to estimate this proportion in the wild.
The comparisons on short strings are expected to be less effective because we have less double words to load.
I implement this code, trying to maximize the performance in both short and large strings.
In the worst cases (strings with less than 8 bytes), we lose almost nothing on performance (-0.27 %).

jhibbits accepted this revision.Jan 13 2019, 2:42 AM

This revision is now accepted and ready to land.Jan 13 2019, 2:42 AM

alfredo added a subscriber: alfredo.Jan 14 2019, 2:54 PM

Improved strcmp performance