Details

Reviewers

jhibbits
nwhitehorn
breno.leitao_gmail.com
gromero_br.ibm.com
leonardo.bianconi_eldorado.org.br

Commits

rS356765: [PPC64] strcpy optimization

Summary

Assembly optimization of strcpy for PowerPC64, using double words instead of bytes to copy strings.

Performance gain with different string size:

String size (bytes)	size <= 8	16 <= size <= 32	64 <= size <= 128	256 <= size <= 512	1024 <= size <= 2048
Gain rate	-0,80%	0,01%	5,26%	18,69%	45,72%
Gain rate (25/02/19)	0,11%	1,8%	6,66%	20,37%	47,15%

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

leonardo.bianconi_eldorado.org.br created this revision.May 9 2018, 3:44 PM

Herald added a subscriber: lffpires_ruabrasil.org. · View Herald TranscriptMay 9 2018, 3:44 PM

Harbormaster completed remote builds in B16496: Diff 42318.May 9 2018, 3:44 PM

leonardo.bianconi_eldorado.org.br edited the summary of this revision. (Show Details)May 9 2018, 4:03 PM

leonardo.bianconi_eldorado.org.br edited the summary of this revision. (Show Details)

leonardo.bianconi_eldorado.org.br added reviewers: jhibbits, nwhitehorn, breno.leitao_gmail.com.May 9 2018, 4:10 PM

leonardo.bianconi_eldorado.org.br added a project: PowerPC.

leonardo.bianconi_eldorado.org.br added subscribers: alexandre.yamashita_eldorado.org.br, breno.leitao_gmail.com, leandro.lupori_gmail.com.

Hi Leonardo,

Can't we use the registers as %rX instead of X? I think Alexandre and the other files use it as %rX

In D15368#324451, @breno.leitao_gmail.com wrote:

Hi Leonardo,

Can't we use the registers as %rX instead of X? I think Alexandre and the other files use it as %rX

Agreed, the existing files all use %rX. Other than that, this looks okay.

alfredo added a subscriber: alfredo.Jan 14 2019, 2:54 PM

Using registers as %r as requested.

Harbormaster completed remote builds in B21980: Diff 52888.Jan 16 2019, 11:39 AM

Hi Leo,

I took a first look at this change and I have the following comments:

Spill of r3 to the red zone can be avoid if you stash it to another unused volatile register, like r8
xor for cleaning the r0 after using it is moot for me, even in the light of any security concern. Caller has total control over dst and scr pointers / contents. Also, r0 is a volatile
On the instruction addi 3,3,-8, please add a comment telling it's a preparation for the stdu in .Lcopy_dw_loop
Byte numbering in the zero checking sequence seems odd to me. I understand you decided to use the MSB0 notation, like used by the Power ISA, so how about starting from 0, like using .Lfound_on_byte0 instead of _byte1, etc?
Currently, the code won't work properly on LE and copy will be truncated because on LE it will check for zero first at the highest char after a 'ld', finding a zero immediately. This is not a big issue atm, as we discussed on IRC and following comments by @jhibbits too. We could use some compile guard to refuse or warn on LE case. That said it looks adding a LE version is not that hard. so I sketched a change that looks to run fine on LE, in case you want to try it. Full source here: http://paste.ubuntu.com/p/6dkhjjPcGm/ , diff here: http://paste.ubuntu.com/p/jmgYcJj3fS/
On performance: I think the regression for <=8 must be worked out. Some solution might be right on the corner... Is the root cause understood?
A question: did you perform any analysis using pipestat? I'm asking because register allocation seems a bit poor general (note the amount of instructions in sequence using only r6), not sure if it could incur in some pipeline bubble, etc.
A nit: the following comment might be improved

/* Treatment of each byte where the zero my be found, saving the correct register data. */

not just s/my/might/ but maybe something like:

/* Copy the correct bytes from register to destination accordingly to where the zero byte was found */

This revision now requires changes to proceed.Jan 16 2019, 2:33 PM

Added improvements based on Gustavo Romero review.

Harbormaster completed remote builds in B22727: Diff 54364.Feb 25 2019, 5:56 PM

Hi Gustavo, thank you for you review. Considering your review and some additional changes, I improved the performance and now, at least, the performance is the same as the original solution.
Explaining: In the beginning of the code, it always tested if the address was aligned, which made the performance worst for short strings. I removed that check, assuming it always will align the address, even if it is already aligned. This change removed two instruction, the test and the branch.
I have reviewed the change for almost all your comments, but I couldn't figure out how to use the pipestat, as FreeBSD is not supported. Do you know if the AT works in CentOS? If so, I could test the code as is and run the pipestat.

leonardo.bianconi_eldorado.org.br edited the summary of this revision. (Show Details)Feb 25 2019, 6:10 PM

This comment has been deleted.

Harbormaster completed remote builds in B23207: Diff 55279.Mar 20 2019, 1:11 PM

Added improvements based on Gustavo Romero review.

Harbormaster completed remote builds in B23209: Diff 55281.Mar 20 2019, 1:20 PM

Harbormaster completed remote builds in B23209: Diff 55281.

Please, Can anyone review this?

This looks good. Thanks for doing the work, and sorry for the delay in reviewing!

I guess now this revision is waiting only for @gromero_br.ibm.com approval, right?

In D15368#430486, @luporl wrote:

I guess now this revision is waiting only for @gromero_br.ibm.com approval, right?

Yes. @gromero_br.ibm.com do you have any more objections?

In D15368#430490, @jhibbits wrote:

In D15368#430486, @luporl wrote:

I guess now this revision is waiting only for @gromero_br.ibm.com approval, right?

Yes. @gromero_br.ibm.com do you have any more objections?

Sorry guys for being the bottleneck here :) I thought it would land automatically after last review from Justin.

Thanks Leo for addressing the regression for the small sizes and the unnecessary spill.

Looks good.

gromero_br.ibm.com accepted this revision.Apr 23 2019, 8:28 PM

This revision is now accepted and ready to land.Apr 23 2019, 8:28 PM

I need to think about this more. According to the PowerISA reference, 'cmpb' is 'new' with PowerISA 2.05, which means this will break the PowerPC 970 and POWER5 and earlier. I already broke it last night with the commit of the strcmp optimization, so don't want to break it further now, until we better reason this.