Page MenuHomeFreeBSD

memcpy optimization for powerpc64
ClosedPublic

Authored by luporl on Apr 17 2018, 6:29 PM.
Referenced Files
Unknown Object (File)
Sun, Nov 24, 3:07 PM
Unknown Object (File)
Sat, Nov 23, 9:29 PM
Unknown Object (File)
Sat, Nov 23, 10:37 AM
Unknown Object (File)
Fri, Nov 22, 9:38 AM
Unknown Object (File)
Thu, Nov 21, 9:12 AM
Unknown Object (File)
Wed, Nov 20, 6:14 AM
Unknown Object (File)
Wed, Nov 20, 4:09 AM
Unknown Object (File)
Sun, Nov 17, 6:23 AM

Details

Summary

This is an attempt at optimizing memcpy/memmove/bcopy for powerpc64.

For copies shorter than 512 bytes, it just copies the data using plain ld/std instructions.
For >=512 bytes, the copy is done in 3 phases:

  • Phase 1: copy from the src buffer until it's aligned at a 16-byte boundary
  • Phase 2: copy as many aligned 64-byte blocks from the src buffer as possible
  • Phase 3: copy the remaining data, if any

In phase 2, this code uses VSX instructions when available. Otherwise, it uses ldx/stdx.

Currently the check for VSX support is being done inside the implementation, but this should be done using ifunc's once they become available for powerpc64.

Some numbers comparing the base (plain C) implementation and this optimization (updated on 20/03/2019):

Gain RateMEMCPYMEMCPYBCOPYBCOPY
VSX?512B-64KB64KB-8MB512B-64KB64KB-8MB
Yes52%81%52%79%
No51%79%47%70%

These numbers show the averages of the run time gain percentage, compared to the base C implementation, for several combinations of source/destination buffer alignments, copy directions (forward/backward), overlapping/non-overlapping buffers and buffer sizes.

For buffer sizes < 512 bytes, as expected, there's no significant difference between this implementation and the base one.

Diff Detail

Lint
Lint Passed
Unit
No Test Coverage
Build Status
Buildable 28505
Build 26560: arc lint + arc unit

Event Timeline

This looks great, with one nit about the way you determine whether VSX is available.

lib/libc/powerpc64/string/bcopy.S
228

It might be easier/better to do this from C by either trying to execute an instruction and trapping, looking at AT_HWCAP, or checking the sysctl hw.cpu_features. This should be more robust and future-proof.

andreast added inline comments.
lib/libc/powerpc64/string/bcopy.S
38

A nit, please avoid c99 comment style. Use /* comment */ instead. Would be consistent with the comments in the other asm files in libc.

lib/libc/powerpc64/string/bcopy.S
70

Can %cr0 be omitted here and on the other cases?

PowerISA says at C.2.3 Branch Mnemonics Incorporating Conditions

If the CR field being tested is CR Field 0, this operand
need not be specified unless the resulting basic mnemonic is bclr[l] or bcctr[l] and the BH operand is specified.

126

What does it do exactly? Looking at the code, r6 contains a double word value of the destination, at it was loaded above:

ld        %r6, 0(%r4)

You are ANDing this value with the len size? I was not able to understand this part

lib/libc/powerpc64/string/bcopy.S
126

I finally understood it. You masking up to 16 bytes the number of bytes that need to be copied later in the next loop. Makes sense.

Moved VSX detection to C code - now it should be simpler to migrate to ifunc's in the future.
Changed to /* */ - style comments.
Removed unnecessary references to %cr0.

lib/libc/powerpc64/string/ppc64_bcopy.c
65 ↗(On Diff #41630)

@nwhitehorn, I just realized I can't call sysctlbyname() here, because it relies on bcopy/memcpy/memmove.
Do you have any pointers on how to safely get AT_HWCAP or on how to use the exec/trap option you mentioned?

Thanks,
Luis

lib/libc/powerpc64/string/ppc64_bcopy.c
65 ↗(On Diff #41630)

It looks like this needs to be added in the kernel. The kernel automatically exports AT_HWCAP if it exists in the sysentvec struct (see sys/powerpc/powerpc/elf64_machdep.c and sys/kern/imgact_elf.c), so if we simply add it to the structure, the way ARM does, that should be sufficient, and you can use elf_aux_info() to retrieve the capability.

lib/libc/powerpc64/string/ppc64_bcopy.c
65 ↗(On Diff #41630)

Can't you just temporarily set bcopy_has_vsx = 0 until this if block finishes?

Use elf_aux_info() for VSX check and set bcopy_has_vsx during the check to avoid infinite loop
For this to work, I had to make changes to the stack guard setup code - I'll post those in a separate review.

lffpires_ruabrasil.org added inline comments.
lib/libc/powerpc64/string/ppc64_bcopy.c
65 ↗(On Diff #41630)

Great idea! After doing that, I was still getting an abort signal, but the root cause was in a completely different area - the stack protection initialization in libc. I'll post a candidate fix for that in a separate review.

65 ↗(On Diff #41630)

Thanks for adding that to the kernel! In addition to setting bcopy_has_vsx to 0, as @nwhitehorn suggested and fixing the stack protection initialization, I'm now using elf_aux_info() to get AT_HWCAP.

The fix for the stack protection issue was merged to main (D15173), so now this diff is ready to be considered for merging as well.

Fixing diff in revision

  • memcpy optimization for powerpc64
  • moved vsx detection to C code (#3)
  • added license
  • set bcopy_has_vsx to 0 during vsx test and use elf_aux_info()

Overall I like the optimization here. I think memmove()/memcpy() are used often for relatively large block copies, so I really like the savings it offers. Just a few comments to address.

lib/libc/powerpc64/string/ppc64_bcopy.c
45–46 ↗(On Diff #42309)

These should be combined to '#elif defined()' instead. Makes it read a little cleaner. In the other places in this file and the other file as well.

82–87 ↗(On Diff #42309)

Why not merge these into just a single call to memmove_vsx()? The only difference between bcopy() and memmove() is the order of src/dst.

lib/libc/powerpc64/string/ppc64_bcopy.c
82–87 ↗(On Diff #42309)

Makes sense. I suggest merging only memcpy+memmove (whose implementations are exactly the same), and still leaving bcopy separate, though. The reason for keeping bcopy separate is that, once we have ifunc support for powerpc64, we'll be ready to just replace the functions in this file with the resolver functions for memcpy, memmove and bcopy. What do you think?

lib/libc/powerpc64/string/ppc64_bcopy.c
82–87 ↗(On Diff #42309)

I meant merge memmove_vsx and bcopy_vsx into memmove_vsx, with bcopy simply calling memmove_vsx() with the swapped arguments. Same below with memmove_plain/bcopy_plain. If memcopy_vsx is identical to memmove_vsx, then they could be merged as well, to simplify the code path even further.

lib/libc/powerpc64/string/ppc64_bcopy.c
82–87 ↗(On Diff #42309)

I was analyzing this and it is possible to simplify removing memcpy.S, memcpy_vsx.S, memmove.S, memmove_vsx.S, ppc64_memcpy.c and ppc64_memmove.c, leaving only the functions bcopy_vsx and bcopy_plain for all cases (memcpy, memmove, bcopy). @jhibbits, was it your intention?
Please, let me know if I'm missing something here and if the other definitions would be used in the future.

lib/libc/powerpc64/string/ppc64_bcopy.c
82–87 ↗(On Diff #42309)

Yeah, that's more what I was thinking.

Bear in mind that memcpy() is intended to be fast while memmove() is intended to be "correct", so we might want to keep memcpy() as separate from bcopy()/memmove() (which perform the exact same function as each other, with swapped arguments).

I did just check the man page for memcpy(3), which shows that it's implemented in terms of bcopy(), so I guess it's not a major concern. However, it'd be interesting to see if there's a performance advantage to ignoring overlap.

Implement memcpy focused in performance

Memcpy differ from memmove and bcopy as it does not verify if parameters
memory overlap, reducing code in order to focus in performance.
Parameters memory must not overlap.

lib/libc/powerpc64/string/ppc64_bcopy.c
82–87 ↗(On Diff #42309)

I have simplified the path related with memmove, as it is the same of bcopy, but still need to have three cases, as the two in the last differ on returning the value.

Hm, this is close, but I think it might read nicer if the *_vsx.S contained the VSX implementations of "phase 2", while the plain files contained the fallback, something like:

#ifndef PHASE_2
#define PHASE_2 \
   sequence_of_ldx/stdx
#endif
...
phase1:
...
phase2:
  PHASE_2
phase3:
...

Then we can add other specializations for whatever other strange ppc64 systems we feel like optimizing for, too (who knows, maybe we'd eventually run on BlueGene/Q).

Please see previous comment. Though it looks fine as-is, I think it'd be better the alternate way. Thoughts?

This revision now requires changes to proceed.Apr 23 2019, 2:56 AM
luporl added a reviewer: lffpires_ruabrasil.org.
luporl added a subscriber: luporl.

Commandeering to address issues and use ifunc to decide whether VSX should be used.

[PPC64] Optimize bcopy/memcpy/memmove

  • Added ifunc support
  • Moved vsx code to _vsx.S files
  • Fixes and refactoring

@jhibbits, this last change addresses your last comment (moving VSX code to _vsx.S files) and also adds ifunc support.

Preliminary tests indicate that a world built with these optimizations continue to work correctly/stably.

Looks fine overall, just the one question. I just want to be sure this is thoroughly tested before it goes in, on the corner cases.

lib/libc/powerpc64/string/bcopy.S
104

Since you're not accounting for alignment, have you tested misaligned addresses, and page boundary crossing? Some CPUs can't handle cross-page loads nicely.

lib/libc/powerpc64/string/bcopy.S
104

I've tested misaligned addresses and page boundary crossing on POWER8 VMs only.

It would be easy to add code to align the source buffer and avoid cross-page loads, but what about cross-page stores? Wouldn't they present issues on some CPUs too? Because it probably wouldn't be feasible to align both source and destination.

Fix ifunc resolver parameter name.

lib/libc/powerpc64/string/bcopy.S
104

Let me expand my last comment, explaining better the possible issues with this and strcpy/strncpy optimizations.

First, an update on the tests performed so far. I've tested all optimizations on POWER8, both VM and baremetal. The tests included all combinations of misaligned source and destination (0 to 15 bytes off), with copies of 32 and 8K bytes, using 4K aligned buffers (plus test offset), in order to trigger a cross-page load/store and test both the single and multi phase parts of the code.

But, for consistency with the other libc optimizations (memcpy, strcpy and strncpy), I think the source buffer should really be aligned here, which could improve the performance on some CPUs.
In strcpy case, not aligning the source buffer can also lead to reading a page that the process is not allowed to. By dword aligning it, we guarantee that we will read no more than the dword where the terminating '\0' is, thus, no page boundary is crossed.

However, there is a potential issue in all cases: it is not feasible to align both source and destination.
Considering that we always align the source, this means that misaligned stores may occur and may even cross a page boundary.
To avoid this, we would probably need to fall back to byte copy, after checking that source and destination can't be aligned.
Note that checking this every time can hurt performance, especially on small copies.
On newer CPUs, that handle misaligned/cross page boundary stores well, this extra code could be omitted, by introducing another function or entry point that the resolver could choose in this case.

In summary, it seems the viable options for handling misaligned/cross page stores are:

  1. Simply use the non-optimized version for older (than POWER8 maybe) CPUs.
  2. Introduce extra code to avoid misaligned/cross page accesses, but skip those on newer CPUs.

In the case of strcpy/strncpy optimizations, that already require the CPU to be >= ARCH_2.05, option 1 seems better.
For memcpy/memmove/bcopy, however, all 64-bit archs can benefit from the optimized dword copy loop, so option 2 may be better here, although for the VSX version we should probably go with option 1 as well.

What do you think?

Align src in bcopy/memmove

luporl added inline comments.
lib/libc/powerpc64/string/bcopy.S
104

As discussed in IRC, misaligned/cross-page stores may hurt performance in old CPUs, but won't cause crashes or wrong behavior.

The last diff now performs source alignment for single phase copies.

This revision is now accepted and ready to land.Jan 14 2020, 5:36 PM
This revision was automatically updated to reflect the committed changes.