POWER architecture CPUs (Book-S) require natural alignment for cache-inhibited storage accesses. Since we can't know the caching model for a page ahead of time, always enforce natural alignment in bcopy. This fixes a SIGBUS when calling the function with misaligned pointers on POWER7
The SIGBUS can be triggered by using lffpires' funcperf without the fix, and it stops with this fix
.Lavoid_vsx leads to .Lsingle_copy, that end up copying double words, that require 8 byte alignment with cache-inhibited storage.
When memory is not aligned you probably want to branch to .Lsingle_1 instead (it may need some adaptations though), that copies bytes.
- improved formatting
- solution to the problem outlined by luporl
The alignment problem is solved, but this patch makes bcopy performance go down by a factor of 10. I think a third version of the code should be considered, restricting this patch only to POWER7 CPUs.
Does it produce any gain over original non-optimized code on Power7? If there's no gain I'd suggest restrict optimizations to ISA >= v.2.07 (POWER8). And in this case, remove unaligned access handling since newer CPUs can handle it transparently.
If there's still a gain I agree with having two optimized versions one handling VSX & ISA <=2.06 and the other for VSX & ISA >=2.07 (probably removing unaligned handling to make code simpler if no negative performance impact)
Before going further with this discussion, it is important to realize that there are actually 2 issues with bcopy to be solved.
The first one is the use of VSX on POWER7 and older CPUs, that causes SIGBUS with unaligned memory.
The second one is unaligned access on cache-inhibited storage, which AFAIK is not allowed on ANY PowerPC CPU.
So, unless we ignore cache-inhibited storage, commonly used for stuff like framebuffers, we just can't allow unaligned accesses to memory.
That's why my suggestion was to do the same as was done with memcpy, of using only byte copies on unaligned memory.
Besides that, source must always be aligned, to prevent issues such as reading across page boundaries, when the upper page memory may not be accessible.
@bruno.larsen_eldorado.org.br, overall your changes look good to me.
I have only one more suggestion and some nitpicks.
Style: add a space after /* and before */.
When src and dst are not aligned between themselves (%r6 == 0), you don't need to run most of the code between .Lavoid_vsx and here, especially the align loop, since data will be copied byte by byte.
IMHO, when detecting that src and dst are not aligned, at the beginning of bcopy, you could just branch to a new label, setup increment/decrement and ctr, advance src and dst to end (in the backward copy case) and jump to single_1_loop.
This may improve performance a bit.
I just checked memcpy and noticed that it doesn't use doubleword copying, just jumps straight into byte copying. I'm wondering if the gain of using dword are enough to justify the loss of time for decision making. If it is, why isn't it part of memcpy? If it isn't, why is it here?
And testing for this is harder than just using funcperf, since it depends heavily on the percentage of code that goes to bcopy unaligned, and that test explicitly only aligns 1/64th of the code it sends to test different performance on all possible configurations, without trying to get an average for real world usage
EDIT: I just checked again and I think I'm blind. It's there, but it is word, instead of dword. That's what got me confused, and I just realised I used the wrong alignment mask, so I'll be correcting that.
- added new execution path in case src and dst are relatively unaligned
- fixed the style issues
The change in execution path has marginal performance improvements, but it's easier to see what is going on.