Page MenuHomeFreeBSD

strncpy optimization for PowerPC64

Authored by luporl on May 9 2018, 3:51 PM.



Update: strncpy.S optimized

Since PowerPC has hardware support for memory alignment, the initial session was removed. Also, the copy-by-byte session (Lstd_byte) was replaced by a faster code.

Performance gain comparing to the former strncpy.S

String size (bytes)size <= 816 <= size <= 3264 <= size <= 128256 <= size <= 5121024 <= size <= 2048
Gain rate0.17%0.31%0.62%0.37%0.22%

Performance gain comparing to the original strncpy (linked in libc):

String size (bytes)size <= 816 <= size <= 3264 <= size <= 128256 <= size <= 5121024 <= size <= 2048
Gain rate0.23%2.48%8.30%25.81%55.12%

Diff Detail

rS FreeBSD src repository - subversion
Automatic diff as part of commit; lint not applicable.
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

Optimizing strncpy.S
Since PowerPC has hardware support for memory alignment, the initial session was removed. Also, the copy-by-byte session (Lstd_byte) was replaced by a faster code.

Related to the former code, the performance gain rates are:

bytesToCopy <= 8 : 0.17%
16 <= bytesToCopy <= 32 : 0.50%
64 <= bytesToCopy <= 128 : 0.60%
256 <= bytesToCopy <= 512 : 0.33%
1024 <= bytesToCopy <= 2048 : 0.19% retitled this revision from strncpy optimization fir PowerPC64 to strncpy optimization for PowerPC64.Apr 12 2019, 12:36 PM

Can you test on misaligned buffers?

Also, this appears to be the only diff that uses raw numbers for registers, instead of using the '%r*' convention. Can you adjust to match the rest of the style?

This revision now requires changes to proceed.Apr 17 2019, 5:02 PM

I think this, and the other related string optimizations, needs revisited regarding alignment. I tested this on a e5500 based system, and glib segfaulted constantly, due to string issues. The problem is that misaligned accesses can cross a page boundary on the last byte(s). If the following page is unmapped, a fault occurs, even if the copy is valid.

  • Use %r* instead of raw numbers for registers

Changed register convention.
Still need to investigate and fix the misaligned access issue.

  • Handled misalignment issues
  • Improved implementation
  • Now using memset() for faster zeroing of dst buffer

I consider the strncpy implementation complete now.
Adding ifunc to select the optimized implementation, however, is still not possible, as rtld doesn't support it yet.

Have you checked performance on this latest version?

Have you checked performance on this latest version?

Yes, I have. It is practically the same as that of the last version, except for cases where 'n' is bigger than the string length. In this case there is a big improvement, as the "zero the rest of the buffer" part was being done byte by byte and now memset() is used instead.

101 ↗(On Diff #59760)

Wouldn't we want to zero the remaining if 0 is found earlier?

101 ↗(On Diff #59760)

Yes, you're right, I've missed this case. I'll fix it.

  • [PPC64] strncpy - fix 'dst' not zeroed issue
  • Initial ifunc capable strncpy implementation

This last change fixes the previous "dst not always zeroed" issue and adds ifunc support to strncpy, so that the optimized version is selected on ISAs >= 2.05 while others can use the fallback implementation in C.

While it works, I've raise a few questions/issues/points to improve in the following comments.

35–36 ↗(On Diff #60328)

There should be a better way to reuse the C implementation of strncpy...

38–39 ↗(On Diff #60328)

This corresponds to the optimized assembly implementation, that is in another .o file.
I've confirmed that this works, but I think I've read somewhere that ifunc resolvers should choose between functions from the same compilation unit. Does anyone know about such limitation?

49 ↗(On Diff #60328)

Maybe strncpy() should also be static when included? Or use a reserved identifier (__strncpy_c)?

  • [PPC64] strncpy - fix rtld-libc build issue
  • [PPC64] strncpy - fix rtld crash

Avoid using ifuncs in rtld.

Although it could be possible to make rtld capable of resolving
its own ifunc relocations, this would add complexity and make
the parts of code executed before ifuncs were relocated dangerous,
as any call to an unresolved ifunc would most probably result in
a crash.

5 ↗(On Diff #60393)

This contains the common, C implementation of strncpy().
It is exported as a weak symbol, to make it possible for binaries like rtld to use the non-ifunc version of the function.
It is also made available as __strncpy(), allowing the resolver to reference it.

This revision is now accepted and ready to land.Dec 12 2019, 5:07 PM
44 ↗(On Diff #63290)

Wait, where does hwcap come from? It's static in lib/libc/gen/auxv.c.

This depends on the ifunc support code landing first, of course.

44 ↗(On Diff #63290)

Agree. In my current code for the ifunc prototype (which I need to submit real soon now), I have r3 called "cpu_features" and r4 called "cpu_features2", NOT hwcap/hwcap2.

44 ↗(On Diff #63290)

Hmm, hwcap comes from my version of the ifunc.h file, from DEFINE_UIFUNC macro.
So, in version of ifunc.h that will be commited it will be called cpu_features instead?

44 ↗(On Diff #63290)

Yeah, I went through a few iterations on it.
Here's a snapshot of what's in my "superglue" integration branch at the moment:

I'll try and get around to finally getting "RTLD WIP3" fully cleaned up and ready today, that's been on my todo list for too long.

44 ↗(On Diff #63290)

Ok, I'll change this part. The other libc optimizations may need this change too, I'll need to check.

Nice, ifunc support in RTLD is the last missing piece blocking libc optimizations and ifunc use in general. Well, that and the ELFv2 switch :)

Fix ifunc resolver parameter name.

This revision now requires review to proceed.Jan 2 2020, 12:40 PM
This revision is now accepted and ready to land.Jan 14 2020, 5:43 PM
This revision was automatically updated to reflect the committed changes.