This is an attempt at optimizing memcpy/memmove/bcopy for powerpc64.
For copies shorter than 512 bytes, it just copies the data using plain ld/std instructions.
For >=512 bytes, the copy is done in 3 phases:
- Phase 1: copy from the src buffer until it's aligned at a 16-byte boundary
- Phase 2: copy as many aligned 64-byte blocks from the src buffer as possible
- Phase 3: copy the remaining data, if any
In phase 2, this code uses VSX instructions when available. Otherwise, it uses ldx/stdx.
Currently the check for VSX support is being done inside the implementation, but this should be done using ifunc's once they become available for powerpc64.
Some numbers comparing the base (plain C) implementation and this optimization (updated on 20/03/2019):
Gain Rate | MEMCPY | MEMCPY | BCOPY | BCOPY |
---|---|---|---|---|
VSX? | 512B-64KB | 64KB-8MB | 512B-64KB | 64KB-8MB |
Yes | 52% | 81% | 52% | 79% |
No | 51% | 79% | 47% | 70% |
These numbers show the averages of the run time gain percentage, compared to the base C implementation, for several combinations of source/destination buffer alignments, copy directions (forward/backward), overlapping/non-overlapping buffers and buffer sizes.
For buffer sizes < 512 bytes, as expected, there's no significant difference between this implementation and the base one.