Page MenuHomeFreeBSD

strncpy optimization for PowerPC64
Needs ReviewPublic

Authored by luporl on May 9 2018, 3:51 PM.

Details

Summary

Update: strncpy.S optimized

Since PowerPC has hardware support for memory alignment, the initial session was removed. Also, the copy-by-byte session (Lstd_byte) was replaced by a faster code.

Performance gain comparing to the former strncpy.S

String size (bytes)size <= 816 <= size <= 3264 <= size <= 128256 <= size <= 5121024 <= size <= 2048
Gain rate0.17%0.31%0.62%0.37%0.22%

Performance gain comparing to the original strncpy (linked in libc):

String size (bytes)size <= 816 <= size <= 3264 <= size <= 128256 <= size <= 5121024 <= size <= 2048
Gain rate0.23%2.48%8.30%25.81%55.12%

Diff Detail

Lint
Lint OK
Unit
No Unit Test Coverage
Build Status
Buildable 25644
Build 24237: arc lint + arc unit

Event Timeline

Optimizing strncpy.S
Since PowerPC has hardware support for memory alignment, the initial session was removed. Also, the copy-by-byte session (Lstd_byte) was replaced by a faster code.

Related to the former code, the performance gain rates are:

bytesToCopy <= 8 : 0.17%
16 <= bytesToCopy <= 32 : 0.50%
64 <= bytesToCopy <= 128 : 0.60%
256 <= bytesToCopy <= 512 : 0.33%
1024 <= bytesToCopy <= 2048 : 0.19%

alexandre.osorio_eldorado.org.br edited the summary of this revision. (Show Details)

Please, Can anyone review this?

leonardo.bianconi_eldorado.org.br retitled this revision from strncpy optimization fir PowerPC64 to strncpy optimization for PowerPC64.Apr 12 2019, 12:36 PM
jhibbits requested changes to this revision.Apr 17 2019, 5:02 PM

Can you test on misaligned buffers?

Also, this appears to be the only diff that uses raw numbers for registers, instead of using the '%r*' convention. Can you adjust to match the rest of the style?

This revision now requires changes to proceed.Apr 17 2019, 5:02 PM
luporl commandeered this revision.Jun 5 2019, 6:18 PM

I think this, and the other related string optimizations, needs revisited regarding alignment. I tested this on a e5500 based system, and glib segfaulted constantly, due to string issues. The problem is that misaligned accesses can cross a page boundary on the last byte(s). If the following page is unmapped, a fault occurs, even if the copy is valid.

luporl updated this revision to Diff 59538.Jul 8 2019, 4:40 PM
  • Use %r* instead of raw numbers for registers
luporl planned changes to this revision.Jul 8 2019, 4:41 PM

Changed register convention.
Still need to investigate and fix the misaligned access issue.

luporl updated this revision to Diff 59760.Jul 15 2019, 1:19 PM
  • Handled misalignment issues
  • Improved implementation
  • Now using memset() for faster zeroing of dst buffer

I consider the strncpy implementation complete now.
Adding ifunc to select the optimized implementation, however, is still not possible, as rtld doesn't support it yet.

Have you checked performance on this latest version?

Have you checked performance on this latest version?

Yes, I have. It is practically the same as that of the last version, except for cases where 'n' is bigger than the string length. In this case there is a big improvement, as the "zero the rest of the buffer" part was being done byte by byte and now memset() is used instead.

jhibbits added inline comments.Jul 24 2019, 3:39 PM
lib/libc/powerpc64/string/strncpy.S
101 ↗(On Diff #59760)

Wouldn't we want to zero the remaining if 0 is found earlier?

luporl added inline comments.Jul 24 2019, 3:56 PM
lib/libc/powerpc64/string/strncpy.S
101 ↗(On Diff #59760)

Yes, you're right, I've missed this case. I'll fix it.

luporl updated this revision to Diff 60328.Jul 31 2019, 6:37 PM
  • [PPC64] strncpy - fix 'dst' not zeroed issue
  • Initial ifunc capable strncpy implementation
luporl marked an inline comment as done.Jul 31 2019, 6:41 PM

This last change fixes the previous "dst not always zeroed" issue and adds ifunc support to strncpy, so that the optimized version is selected on ISAs >= 2.05 while others can use the fallback implementation in C.

While it works, I've raise a few questions/issues/points to improve in the following comments.

lib/libc/powerpc64/string/strncpy.c
31–32

This corresponds to the optimized assembly implementation, that is in another .o file.
I've confirmed that this works, but I think I've read somewhere that ifunc resolvers should choose between functions from the same compilation unit. Does anyone know about such limitation?

34–35

There should be a better way to reuse the C implementation of strncpy...
Suggestions?

lib/libc/string/strncpy.c
53

Maybe strncpy() should also be static when included? Or use a reserved identifier (__strncpy_c)?

luporl updated this revision to Diff 60343.Jul 31 2019, 9:31 PM
  • [PPC64] strncpy - fix rtld-libc build issue
luporl updated this revision to Diff 60393.Aug 2 2019, 4:10 PM
  • [PPC64] strncpy - fix rtld crash

Avoid using ifuncs in rtld.

Although it could be possible to make rtld capable of resolving
its own ifunc relocations, this would add complexity and make
the parts of code executed before ifuncs were relocated dangerous,
as any call to an unresolved ifunc would most probably result in
a crash.

luporl added inline comments.Aug 2 2019, 4:18 PM
lib/libc/powerpc64/string/Makefile.inc
5

This contains the common, C implementation of strncpy().
It is exported as a weak symbol, to make it possible for binaries like rtld to use the non-ifunc version of the function.
It is also made available as __strncpy(), allowing the resolver to reference it.