Page MenuHomeFreeBSD

strcpy optimization for PowerPC64
Needs ReviewPublic

Authored by luporl on May 9 2018, 3:44 PM.

Details

Summary

Assembly optimization of strcpy for PowerPC64, using double words instead of bytes to copy strings.

Performance gain with different string size:

String size (bytes)size <= 816 <= size <= 3264 <= size <= 128256 <= size <= 5121024 <= size <= 2048
Gain rate-0,80%0,01%5,26%18,69%45,72%
Gain rate (25/02/19)0,11%1,8%6,66%20,37%47,15%

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 25690
Build 24279: arc lint + arc unit

Event Timeline

leonardo.bianconi_eldorado.org.br edited the summary of this revision. (Show Details)

Hi Leonardo,

Can't we use the registers as %rX instead of X? I think Alexandre and the other files use it as %rX

Hi Leonardo,
Can't we use the registers as %rX instead of X? I think Alexandre and the other files use it as %rX

Agreed, the existing files all use %rX. Other than that, this looks okay.

Using registers as %r as requested.

gromero_br.ibm.com requested changes to this revision.EditedJan 16 2019, 2:33 PM

Hi Leo,

I took a first look at this change and I have the following comments:

  • Spill of r3 to the red zone can be avoid if you stash it to another unused volatile register, like r8
  • xor for cleaning the r0 after using it is moot for me, even in the light of any security concern. Caller has total control over dst and scr pointers / contents. Also, r0 is a volatile
  • On the instruction addi 3,3,-8, please add a comment telling it's a preparation for the stdu in .Lcopy_dw_loop
  • Byte numbering in the zero checking sequence seems odd to me. I understand you decided to use the MSB0 notation, like used by the Power ISA, so how about starting from 0, like using .Lfound_on_byte0 instead of _byte1, etc?
  • Currently, the code won't work properly on LE and copy will be truncated because on LE it will check for zero first at the highest char after a 'ld', finding a zero immediately. This is not a big issue atm, as we discussed on IRC and following comments by @jhibbits too. We could use some compile guard to refuse or warn on LE case. That said it looks adding a LE version is not that hard. so I sketched a change that looks to run fine on LE, in case you want to try it. Full source here: http://paste.ubuntu.com/p/6dkhjjPcGm/ , diff here: http://paste.ubuntu.com/p/jmgYcJj3fS/
  • On performance: I think the regression for <=8 must be worked out. Some solution might be right on the corner... Is the root cause understood?
  • A question: did you perform any analysis using pipestat? I'm asking because register allocation seems a bit poor general (note the amount of instructions in sequence using only r6), not sure if it could incur in some pipeline bubble, etc.
  • A nit: the following comment might be improved

/* Treatment of each byte where the zero my be found, saving the correct register data. */

not just s/my/might/ but maybe something like:

/* Copy the correct bytes from register to destination accordingly to where the zero byte was found */

This revision now requires changes to proceed.Jan 16 2019, 2:33 PM

Added improvements based on Gustavo Romero review.

Hi Gustavo, thank you for you review. Considering your review and some additional changes, I improved the performance and now, at least, the performance is the same as the original solution.
Explaining: In the beginning of the code, it always tested if the address was aligned, which made the performance worst for short strings. I removed that check, assuming it always will align the address, even if it is already aligned. This change removed two instruction, the test and the branch.
I have reviewed the change for almost all your comments, but I couldn't figure out how to use the pipestat, as FreeBSD is not supported. Do you know if the AT works in CentOS? If so, I could test the code as is and run the pipestat.

This comment has been deleted.

Added improvements based on Gustavo Romero review.

Harbormaster completed remote builds in B23209: Diff 55281.

Please, Can anyone review this?

jhibbits accepted this revision.Apr 17 2019, 2:51 PM

This looks good. Thanks for doing the work, and sorry for the delay in reviewing!

luporl added a subscriber: luporl.Apr 23 2019, 5:57 PM

I guess now this revision is waiting only for @gromero_br.ibm.com approval, right?

I guess now this revision is waiting only for @gromero_br.ibm.com approval, right?

Yes. @gromero_br.ibm.com do you have any more objections?

I guess now this revision is waiting only for @gromero_br.ibm.com approval, right?

Yes. @gromero_br.ibm.com do you have any more objections?

Sorry guys for being the bottleneck here :) I thought it would land automatically after last review from Justin.

Thanks Leo for addressing the regression for the small sizes and the unnecessary spill.

Looks good.

This revision is now accepted and ready to land.Apr 23 2019, 8:28 PM

I need to think about this more. According to the PowerISA reference, 'cmpb' is 'new' with PowerISA 2.05, which means this will break the PowerPC 970 and POWER5 and earlier. I already broke it last night with the commit of the strcmp optimization, so don't want to break it further now, until we better reason this.

luporl commandeered this revision.Aug 5 2019, 1:57 PM

Taking over to add ifunc support, which will make it possible to have the optimized strcpy version and avoid breaking POWER5 and earlier.

luporl updated this revision to Diff 60482.Aug 5 2019, 5:46 PM
  • Use ifunc to choose best implementation based on running system
This revision now requires review to proceed.Aug 5 2019, 5:46 PM