Details

Reviewers

jhibbits
nwhitehorn
breno.leitao_gmail.com
lffpires_ruabrasil.org

Commits

rS356767: [PPC64] memcpy/memmove/bcopy optimization

Summary

This is an attempt at optimizing memcpy/memmove/bcopy for powerpc64.

For copies shorter than 512 bytes, it just copies the data using plain ld/std instructions.
For >=512 bytes, the copy is done in 3 phases:

Phase 1: copy from the src buffer until it's aligned at a 16-byte boundary
Phase 2: copy as many aligned 64-byte blocks from the src buffer as possible
Phase 3: copy the remaining data, if any

In phase 2, this code uses VSX instructions when available. Otherwise, it uses ldx/stdx.

Currently the check for VSX support is being done inside the implementation, but this should be done using ifunc's once they become available for powerpc64.

Some numbers comparing the base (plain C) implementation and this optimization (updated on 20/03/2019):

Gain Rate	MEMCPY	MEMCPY	BCOPY	BCOPY
VSX?	512B-64KB	64KB-8MB	512B-64KB	64KB-8MB
Yes	52%	81%	52%	79%
No	51%	79%	47%	70%

These numbers show the averages of the run time gain percentage, compared to the base C implementation, for several combinations of source/destination buffer alignments, copy directions (forward/backward), overlapping/non-overlapping buffers and buffer sizes.

For buffer sizes < 512 bytes, as expected, there's no significant difference between this implementation and the base one.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

lffpires_ruabrasil.org created this revision.Apr 17 2018, 6:29 PM

Harbormaster completed remote builds in B16194: Diff 41587.Apr 17 2018, 6:29 PM

lffpires_ruabrasil.org edited the summary of this revision. (Show Details)Apr 17 2018, 6:44 PM

lffpires_ruabrasil.org added reviewers: jhibbits, nwhitehorn, breno.leitao_gmail.com.Apr 17 2018, 6:47 PM

lffpires_ruabrasil.org edited the summary of this revision. (Show Details)Apr 17 2018, 6:50 PM

This looks great, with one nit about the way you determine whether VSX is available.

lib/libc/powerpc64/string/bcopy.S
227 ↗	(On Diff #41587)	It might be easier/better to do this from C by either trying to execute an instruction and trapping, looking at AT_HWCAP, or checking the sysctl hw.cpu_features. This should be more robust and future-proof.

andreast added a subscriber: andreast.Apr 17 2018, 8:13 PM

andreast added inline comments.

lib/libc/powerpc64/string/bcopy.S
37 ↗	(On Diff #41587)	A nit, please avoid c99 comment style. Use /* comment */ instead. Would be consistent with the comments in the other asm files in libc.

breno.leitao_gmail.com added inline comments.Apr 18 2018, 1:26 PM

lib/libc/powerpc64/string/bcopy.S
69 ↗	(On Diff #41587)	Can %cr0 be omitted here and on the other cases? PowerISA says at C.2.3 Branch Mnemonics Incorporating Conditions If the CR field being tested is CR Field 0, this operand need not be specified unless the resulting basic mnemonic is bclr[l] or bcctr[l] and the BH operand is specified.
125 ↗	(On Diff #41587)	What does it do exactly? Looking at the code, r6 contains a double word value of the destination, at it was loaded above: ld %r6, 0(%r4) You are ANDing this value with the len size? I was not able to understand this part

breno.leitao_gmail.com added inline comments.Apr 18 2018, 1:51 PM

lib/libc/powerpc64/string/bcopy.S
125 ↗	(On Diff #41587)	I finally understood it. You masking up to 16 bytes the number of bytes that need to be copied later in the next loop. Makes sense.

lffpires_ruabrasil.org added a project: PowerPC.Apr 18 2018, 3:37 PM

Moved VSX detection to C code - now it should be simpler to migrate to ifunc's in the future.
Changed to /* */ - style comments.
Removed unnecessary references to %cr0.

Harbormaster completed remote builds in B16215: Diff 41630.Apr 18 2018, 6:42 PM

lffpires_ruabrasil.org marked 3 inline comments as done.Apr 18 2018, 6:43 PM

lffpires_ruabrasil.org added inline comments.Apr 18 2018, 7:37 PM

lib/libc/powerpc64/string/ppc64_bcopy.c
65 ↗	(On Diff #41630)	@nwhitehorn, I just realized I can't call sysctlbyname() here, because it relies on bcopy/memcpy/memmove. Do you have any pointers on how to safely get AT_HWCAP or on how to use the exec/trap option you mentioned? Thanks, Luis

jhibbits added inline comments.Apr 18 2018, 7:50 PM

lib/libc/powerpc64/string/ppc64_bcopy.c
65 ↗	(On Diff #41630)	It looks like this needs to be added in the kernel. The kernel automatically exports AT_HWCAP if it exists in the sysentvec struct (see sys/powerpc/powerpc/elf64_machdep.c and sys/kern/imgact_elf.c), so if we simply add it to the structure, the way ARM does, that should be sufficient, and you can use elf_aux_info() to retrieve the capability.

nwhitehorn added inline comments.Apr 18 2018, 7:53 PM

lib/libc/powerpc64/string/ppc64_bcopy.c
65 ↗	(On Diff #41630)	Can't you just temporarily set bcopy_has_vsx = 0 until this if block finishes?

Use elf_aux_info() for VSX check and set bcopy_has_vsx during the check to avoid infinite loop
For this to work, I had to make changes to the stack guard setup code - I'll post those in a separate review.

Harbormaster completed remote builds in B16269: Diff 41772.Apr 23 2018, 5:23 PM

lffpires_ruabrasil.org marked 3 inline comments as done.Apr 23 2018, 5:27 PM

lffpires_ruabrasil.org added inline comments.

lib/libc/powerpc64/string/ppc64_bcopy.c
65 ↗	(On Diff #41630)	Great idea! After doing that, I was still getting an abort signal, but the root cause was in a completely different area - the stack protection initialization in libc. I'll post a candidate fix for that in a separate review.
65 ↗	(On Diff #41630)	Thanks for adding that to the kernel! In addition to setting bcopy_has_vsx to 0, as @nwhitehorn suggested and fixing the stack protection initialization, I'm now using elf_aux_info() to get AT_HWCAP.

The fix for the stack protection issue was merged to main (D15173), so now this diff is ready to be considered for merging as well.

lffpires_ruabrasil.org marked 2 inline comments as done.Apr 25 2018, 5:57 PM

lffpires_ruabrasil.org added subscribers: leandro.lupori_gmail.com, leonardo.bianconi_eldorado.org.br, alexandre.yamashita_eldorado.org.br.Apr 25 2018, 6:56 PM

Fixing diff in revision

memcpy optimization for powerpc64
moved vsx detection to C code (#3)
added license
set bcopy_has_vsx to 0 during vsx test and use elf_aux_info()

Harbormaster completed remote builds in B16493: Diff 42309.May 9 2018, 1:16 PM

Overall I like the optimization here. I think memmove()/memcpy() are used often for relatively large block copies, so I really like the savings it offers. Just a few comments to address.

lib/libc/powerpc64/string/ppc64_bcopy.c
45–46 ↗	(On Diff #42309)	These should be combined to '#elif defined()' instead. Makes it read a little cleaner. In the other places in this file and the other file as well.
82–87 ↗	(On Diff #42309)	Why not merge these into just a single call to memmove_vsx()? The only difference between bcopy() and memmove() is the order of src/dst.

lffpires_ruabrasil.org added inline comments.Jun 4 2018, 1:58 PM

lib/libc/powerpc64/string/ppc64_bcopy.c
82–87 ↗	(On Diff #42309)	Makes sense. I suggest merging only memcpy+memmove (whose implementations are exactly the same), and still leaving bcopy separate, though. The reason for keeping bcopy separate is that, once we have ifunc support for powerpc64, we'll be ready to just replace the functions in this file with the resolver functions for memcpy, memmove and bcopy. What do you think?

jhibbits added inline comments.Jul 5 2018, 1:40 AM

lib/libc/powerpc64/string/ppc64_bcopy.c
82–87 ↗	(On Diff #42309)	I meant merge memmove_vsx and bcopy_vsx into memmove_vsx, with bcopy simply calling memmove_vsx() with the swapped arguments. Same below with memmove_plain/bcopy_plain. If memcopy_vsx is identical to memmove_vsx, then they could be merged as well, to simplify the code path even further.

leonardo.bianconi_eldorado.org.br added inline comments.Mar 1 2019, 4:29 PM

lib/libc/powerpc64/string/ppc64_bcopy.c
82–87 ↗	(On Diff #42309)	I was analyzing this and it is possible to simplify removing `memcpy.S`, `memcpy_vsx.S`, `memmove.S`, `memmove_vsx.S`, `ppc64_memcpy.c` and `ppc64_memmove.c`, leaving only the functions `bcopy_vsx` and `bcopy_plain` for all cases (memcpy, memmove, bcopy). @jhibbits, was it your intention? Please, let me know if I'm missing something here and if the other definitions would be used in the future.

jhibbits added inline comments.Mar 1 2019, 4:46 PM

lib/libc/powerpc64/string/ppc64_bcopy.c
82–87 ↗	(On Diff #42309)	Yeah, that's more what I was thinking. Bear in mind that memcpy() is intended to be fast while memmove() is intended to be "correct", so we might want to keep memcpy() as separate from bcopy()/memmove() (which perform the exact same function as each other, with swapped arguments). I did just check the man page for memcpy(3), which shows that it's implemented in terms of bcopy(), so I guess it's not a major concern. However, it'd be interesting to see if there's a performance advantage to ignoring overlap.

Implement memcpy focused in performance

Memcpy differ from memmove and bcopy as it does not verify if parameters
memory overlap, reducing code in order to focus in performance.
Parameters memory must not overlap.

Harbormaster completed remote builds in B23210: Diff 55282.Mar 20 2019, 1:23 PM

leonardo.bianconi_eldorado.org.br edited the summary of this revision. (Show Details)Mar 20 2019, 1:25 PM

leonardo.bianconi_eldorado.org.br added inline comments.Mar 20 2019, 1:35 PM

lib/libc/powerpc64/string/ppc64_bcopy.c
82–87 ↗	(On Diff #42309)	I have simplified the path related with memmove, as it is the same of bcopy, but still need to have three cases, as the two in the last differ on returning the value.

Please, Can anyone review this?

Hm, this is close, but I think it might read nicer if the *_vsx.S contained the VSX implementations of "phase 2", while the plain files contained the fallback, something like:

#ifndef PHASE_2
#define PHASE_2 \
   sequence_of_ldx/stdx
#endif
...
phase1:
...
phase2:
  PHASE_2
phase3:
...

Then we can add other specializations for whatever other strange ppc64 systems we feel like optimizing for, too (who knows, maybe we'd eventually run on BlueGene/Q).

Please see previous comment. Though it looks fine as-is, I think it'd be better the alternate way. Thoughts?

This revision now requires changes to proceed.Apr 23 2019, 2:56 AM

Commandeering to address issues and use ifunc to decide whether VSX should be used.

luporl marked 8 inline comments as done.Aug 5 2019, 6:10 PM

[PPC64] Optimize bcopy/memcpy/memmove

Added ifunc support
Moved vsx code to _vsx.S files
Fixes and refactoring

Harbormaster completed remote builds in B25746: Diff 60579.Aug 8 2019, 1:47 PM

@jhibbits, this last change addresses your last comment (moving VSX code to _vsx.S files) and also adds ifunc support.

Preliminary tests indicate that a world built with these optimizations continue to work correctly/stably.

Looks fine overall, just the one question. I just want to be sure this is thoroughly tested before it goes in, on the corner cases.

lib/libc/powerpc64/string/bcopy.S
103 ↗	(On Diff #60579)	Since you're not accounting for alignment, have you tested misaligned addresses, and page boundary crossing? Some CPUs can't handle cross-page loads nicely.

luporl added inline comments.Aug 12 2019, 2:07 PM

lib/libc/powerpc64/string/bcopy.S
103 ↗	(On Diff #60579)	I've tested misaligned addresses and page boundary crossing on POWER8 VMs only. It would be easy to add code to align the source buffer and avoid cross-page loads, but what about cross-page stores? Wouldn't they present issues on some CPUs too? Because it probably wouldn't be feasible to align both source and destination.

alfredo added a subscriber: alfredo.Nov 22 2019, 4:39 PM

luporl mentioned this in D22789: [PowerPC] powerpc64 rtld IFUNC handling code.Dec 13 2019, 1:54 PM

luporl added parent revisions: D22787: [PowerPC] Define powerpc IFUNC kernel/userspace ABI., D22789: [PowerPC] powerpc64 rtld IFUNC handling code.Dec 13 2019, 2:04 PM

Herald added a subscriber: bdragon. · View Herald TranscriptDec 13 2019, 2:04 PM

Fix ifunc resolver parameter name.

Harbormaster completed remote builds in B28437: Diff 66243.Jan 2 2020, 12:43 PM

luporl added inline comments.Jan 3 2020, 8:01 PM

lib/libc/powerpc64/string/bcopy.S
103 ↗	(On Diff #60579)	Let me expand my last comment, explaining better the possible issues with this and strcpy/strncpy optimizations. First, an update on the tests performed so far. I've tested all optimizations on POWER8, both VM and baremetal. The tests included all combinations of misaligned source and destination (0 to 15 bytes off), with copies of 32 and 8K bytes, using 4K aligned buffers (plus test offset), in order to trigger a cross-page load/store and test both the single and multi phase parts of the code. But, for consistency with the other libc optimizations (memcpy, strcpy and strncpy), I think the source buffer should really be aligned here, which could improve the performance on some CPUs. In strcpy case, not aligning the source buffer can also lead to reading a page that the process is not allowed to. By dword aligning it, we guarantee that we will read no more than the dword where the terminating '\0' is, thus, no page boundary is crossed. However, there is a potential issue in all cases: it is not feasible to align both source and destination. Considering that we always align the source, this means that misaligned stores may occur and may even cross a page boundary. To avoid this, we would probably need to fall back to byte copy, after checking that source and destination can't be aligned. Note that checking this every time can hurt performance, especially on small copies. On newer CPUs, that handle misaligned/cross page boundary stores well, this extra code could be omitted, by introducing another function or entry point that the resolver could choose in this case. In summary, it seems the viable options for handling misaligned/cross page stores are: Simply use the non-optimized version for older (than POWER8 maybe) CPUs. Introduce extra code to avoid misaligned/cross page accesses, but skip those on newer CPUs. In the case of strcpy/strncpy optimizations, that already require the CPU to be >= ARCH_2.05, option 1 seems better. For memcpy/memmove/bcopy, however, all 64-bit archs can benefit from the optimized dword copy loop, so option 2 may be better here, although for the VSX version we should probably go with option 1 as well. What do you think?

Align src in bcopy/memmove

Harbormaster completed remote builds in B28505: Diff 66414.Jan 6 2020, 5:59 PM

luporl marked an inline comment as done.Jan 6 2020, 6:11 PM

luporl added inline comments.

lib/libc/powerpc64/string/bcopy.S
103 ↗	(On Diff #60579)	As discussed in IRC, misaligned/cross-page stores may hurt performance in old CPUs, but won't cause crashes or wrong behavior. The last diff now performs source alignment for single phase copies.

jhibbits accepted this revision.Jan 14 2020, 5:36 PM

This revision is now accepted and ready to land.Jan 14 2020, 5:36 PM

Closed by commit rS356767: [PPC64] memcpy/memmove/bcopy optimization (authored by luporl). · Explain WhyJan 15 2020, 8:25 PM

This revision was automatically updated to reflect the committed changes.

luporl added a commit: rS356767: [PPC64] memcpy/memmove/bcopy optimization.

Herald added a subscriber: imp. · View Herald TranscriptJan 15 2020, 8:26 PM

memcpy optimization for powerpc64
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 66801

head/lib/libc/powerpc64/string/Makefile.inc

head/lib/libc/powerpc64/string/bcopy.S

head/lib/libc/powerpc64/string/bcopy_resolver.c

head/lib/libc/powerpc64/string/bcopy_vsx.S

head/lib/libc/powerpc64/string/memcpy.S

head/lib/libc/powerpc64/string/memcpy_resolver.c

head/lib/libc/powerpc64/string/memcpy_vsx.S

head/lib/libc/powerpc64/string/memmove.S

head/lib/libc/powerpc64/string/memmove_resolver.c

head/lib/libc/powerpc64/string/memmove_vsx.S

memcpy optimization for powerpc64ClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 66801

head/lib/libc/powerpc64/string/Makefile.inc

head/lib/libc/powerpc64/string/bcopy.S

head/lib/libc/powerpc64/string/bcopy_resolver.c

head/lib/libc/powerpc64/string/bcopy_vsx.S

head/lib/libc/powerpc64/string/memcpy.S

head/lib/libc/powerpc64/string/memcpy_resolver.c

head/lib/libc/powerpc64/string/memcpy_vsx.S

head/lib/libc/powerpc64/string/memmove.S

head/lib/libc/powerpc64/string/memmove_resolver.c

head/lib/libc/powerpc64/string/memmove_vsx.S

memcpy optimization for powerpc64
ClosedPublic
Actions

Revision Contents
Changeset List