Page MenuHomeFreeBSD

memcpy optimization for powerpc64
Needs ReviewPublic

Authored by luporl on Apr 17 2018, 6:29 PM.



This is an attempt at optimizing memcpy/memmove/bcopy for powerpc64.

For copies shorter than 512 bytes, it just copies the data using plain ld/std instructions.
For >=512 bytes, the copy is done in 3 phases:

  • Phase 1: copy from the src buffer until it's aligned at a 16-byte boundary
  • Phase 2: copy as many aligned 64-byte blocks from the src buffer as possible
  • Phase 3: copy the remaining data, if any

In phase 2, this code uses VSX instructions when available. Otherwise, it uses ldx/stdx.

Currently the check for VSX support is being done inside the implementation, but this should be done using ifunc's once they become available for powerpc64.

Some numbers comparing the base (plain C) implementation and this optimization (updated on 20/03/2019):


These numbers show the averages of the run time gain percentage, compared to the base C implementation, for several combinations of source/destination buffer alignments, copy directions (forward/backward), overlapping/non-overlapping buffers and buffer sizes.

For buffer sizes < 512 bytes, as expected, there's no significant difference between this implementation and the base one.

Diff Detail

Lint OK
No Unit Test Coverage
Build Status
Buildable 25746
Build 24326: arc lint + arc unit

Event Timeline edited the summary of this revision. (Show Details)Apr 17 2018, 6:44 PM edited the summary of this revision. (Show Details)Apr 17 2018, 6:50 PM

This looks great, with one nit about the way you determine whether VSX is available.


It might be easier/better to do this from C by either trying to execute an instruction and trapping, looking at AT_HWCAP, or checking the sysctl hw.cpu_features. This should be more robust and future-proof.

andreast added inline comments.

A nit, please avoid c99 comment style. Use /* comment */ instead. Would be consistent with the comments in the other asm files in libc.


Can %cr0 be omitted here and on the other cases?

PowerISA says at C.2.3 Branch Mnemonics Incorporating Conditions

If the CR field being tested is CR Field 0, this operand
need not be specified unless the resulting basic mnemonic is bclr[l] or bcctr[l] and the BH operand is specified.


What does it do exactly? Looking at the code, r6 contains a double word value of the destination, at it was loaded above:

ld        %r6, 0(%r4)

You are ANDing this value with the len size? I was not able to understand this part


I finally understood it. You masking up to 16 bytes the number of bytes that need to be copied later in the next loop. Makes sense.

Moved VSX detection to C code - now it should be simpler to migrate to ifunc's in the future.
Changed to /* */ - style comments.
Removed unnecessary references to %cr0. marked 3 inline comments as done.Apr 18 2018, 6:43 PM
65 ↗(On Diff #41630)

@nwhitehorn, I just realized I can't call sysctlbyname() here, because it relies on bcopy/memcpy/memmove.
Do you have any pointers on how to safely get AT_HWCAP or on how to use the exec/trap option you mentioned?


jhibbits added inline comments.Apr 18 2018, 7:50 PM
65 ↗(On Diff #41630)

It looks like this needs to be added in the kernel. The kernel automatically exports AT_HWCAP if it exists in the sysentvec struct (see sys/powerpc/powerpc/elf64_machdep.c and sys/kern/imgact_elf.c), so if we simply add it to the structure, the way ARM does, that should be sufficient, and you can use elf_aux_info() to retrieve the capability.

nwhitehorn added inline comments.Apr 18 2018, 7:53 PM
65 ↗(On Diff #41630)

Can't you just temporarily set bcopy_has_vsx = 0 until this if block finishes?

Use elf_aux_info() for VSX check and set bcopy_has_vsx during the check to avoid infinite loop
For this to work, I had to make changes to the stack guard setup code - I'll post those in a separate review. marked 3 inline comments as done.Apr 23 2018, 5:27 PM added inline comments.
65 ↗(On Diff #41630)

Great idea! After doing that, I was still getting an abort signal, but the root cause was in a completely different area - the stack protection initialization in libc. I'll post a candidate fix for that in a separate review.

65 ↗(On Diff #41630)

Thanks for adding that to the kernel! In addition to setting bcopy_has_vsx to 0, as @nwhitehorn suggested and fixing the stack protection initialization, I'm now using elf_aux_info() to get AT_HWCAP. marked 2 inline comments as done.Apr 24 2018, 6:57 PM

The fix for the stack protection issue was merged to main (D15173), so now this diff is ready to be considered for merging as well. marked 2 inline comments as done.Apr 25 2018, 5:57 PM

Fixing diff in revision

  • memcpy optimization for powerpc64
  • moved vsx detection to C code (#3)
  • added license
  • set bcopy_has_vsx to 0 during vsx test and use elf_aux_info()

Overall I like the optimization here. I think memmove()/memcpy() are used often for relatively large block copies, so I really like the savings it offers. Just a few comments to address.

45–46 ↗(On Diff #42309)

These should be combined to '#elif defined()' instead. Makes it read a little cleaner. In the other places in this file and the other file as well.

82–87 ↗(On Diff #42309)

Why not merge these into just a single call to memmove_vsx()? The only difference between bcopy() and memmove() is the order of src/dst.

82–87 ↗(On Diff #42309)

Makes sense. I suggest merging only memcpy+memmove (whose implementations are exactly the same), and still leaving bcopy separate, though. The reason for keeping bcopy separate is that, once we have ifunc support for powerpc64, we'll be ready to just replace the functions in this file with the resolver functions for memcpy, memmove and bcopy. What do you think?

jhibbits added inline comments.Jul 5 2018, 1:40 AM
82–87 ↗(On Diff #42309)

I meant merge memmove_vsx and bcopy_vsx into memmove_vsx, with bcopy simply calling memmove_vsx() with the swapped arguments. Same below with memmove_plain/bcopy_plain. If memcopy_vsx is identical to memmove_vsx, then they could be merged as well, to simplify the code path even further.

82–87 ↗(On Diff #42309)

I was analyzing this and it is possible to simplify removing memcpy.S, memcpy_vsx.S, memmove.S, memmove_vsx.S, ppc64_memcpy.c and ppc64_memmove.c, leaving only the functions bcopy_vsx and bcopy_plain for all cases (memcpy, memmove, bcopy). @jhibbits, was it your intention?
Please, let me know if I'm missing something here and if the other definitions would be used in the future.

jhibbits added inline comments.Mar 1 2019, 4:46 PM
82–87 ↗(On Diff #42309)

Yeah, that's more what I was thinking.

Bear in mind that memcpy() is intended to be fast while memmove() is intended to be "correct", so we might want to keep memcpy() as separate from bcopy()/memmove() (which perform the exact same function as each other, with swapped arguments).

I did just check the man page for memcpy(3), which shows that it's implemented in terms of bcopy(), so I guess it's not a major concern. However, it'd be interesting to see if there's a performance advantage to ignoring overlap.

Implement memcpy focused in performance

Memcpy differ from memmove and bcopy as it does not verify if parameters
memory overlap, reducing code in order to focus in performance.
Parameters memory must not overlap.

82–87 ↗(On Diff #42309)

I have simplified the path related with memmove, as it is the same of bcopy, but still need to have three cases, as the two in the last differ on returning the value.

Please, Can anyone review this?

Hm, this is close, but I think it might read nicer if the *_vsx.S contained the VSX implementations of "phase 2", while the plain files contained the fallback, something like:

#ifndef PHASE_2
#define PHASE_2 \

Then we can add other specializations for whatever other strange ppc64 systems we feel like optimizing for, too (who knows, maybe we'd eventually run on BlueGene/Q).

jhibbits requested changes to this revision.Apr 23 2019, 2:56 AM

Please see previous comment. Though it looks fine as-is, I think it'd be better the alternate way. Thoughts?

This revision now requires changes to proceed.Apr 23 2019, 2:56 AM
luporl commandeered this revision.Aug 5 2019, 5:57 PM
luporl added a reviewer:
luporl added a subscriber: luporl.

Commandeering to address issues and use ifunc to decide whether VSX should be used.

luporl marked 8 inline comments as done.Aug 5 2019, 6:10 PM
luporl updated this revision to Diff 60579.Aug 8 2019, 1:47 PM

[PPC64] Optimize bcopy/memcpy/memmove

  • Added ifunc support
  • Moved vsx code to _vsx.S files
  • Fixes and refactoring

@jhibbits, this last change addresses your last comment (moving VSX code to _vsx.S files) and also adds ifunc support.

Preliminary tests indicate that a world built with these optimizations continue to work correctly/stably.

Looks fine overall, just the one question. I just want to be sure this is thoroughly tested before it goes in, on the corner cases.


Since you're not accounting for alignment, have you tested misaligned addresses, and page boundary crossing? Some CPUs can't handle cross-page loads nicely.

luporl added inline comments.Aug 12 2019, 2:07 PM

I've tested misaligned addresses and page boundary crossing on POWER8 VMs only.

It would be easy to add code to align the source buffer and avoid cross-page loads, but what about cross-page stores? Wouldn't they present issues on some CPUs too? Because it probably wouldn't be feasible to align both source and destination.