Page MenuHomeFreeBSD

powerpc/powerpc64: Enforce natural alignment in bcopy
ClosedPublic

Authored by bruno.larsen_eldorado.org.br on Feb 18 2021, 5:15 PM.

Details

Summary

POWER architecture CPUs (Book-S) require natural alignment for cache-inhibited storage accesses. Since we can't know the caching model for a page ahead of time, always enforce natural alignment in bcopy. This fixes a SIGBUS when calling the function with misaligned pointers on POWER7

Test Plan

The SIGBUS can be triggered by using lffpires' funcperf without the fix, and it stops with this fix

Diff Detail

Repository
R10 FreeBSD src repository
Lint
Automatic diff as part of commit; lint not applicable.
Unit
Automatic diff as part of commit; unit tests not applicable.

Event Timeline

lib/libc/powerpc64/string/bcopy.S
104–106

.Lavoid_vsx leads to .Lsingle_copy, that end up copying double words, that require 8 byte alignment with cache-inhibited storage.

When memory is not aligned you probably want to branch to .Lsingle_1 instead (it may need some adaptations though), that copies bytes.

  • improved formatting
  • solution to the problem outlined by luporl

The alignment problem is solved, but this patch makes bcopy performance go down by a factor of 10. I think a third version of the code should be considered, restricting this patch only to POWER7 CPUs.

  • improved formatting
  • solution to the problem outlined by luporl

The alignment problem is solved, but this patch makes bcopy performance go down by a factor of 10. I think a third version of the code should be considered, restricting this patch only to POWER7 CPUs.

Does it produce any gain over original non-optimized code on Power7? If there's no gain I'd suggest restrict optimizations to ISA >= v.2.07 (POWER8). And in this case, remove unaligned access handling since newer CPUs can handle it transparently.

If there's still a gain I agree with having two optimized versions one handling VSX & ISA <=2.06 and the other for VSX & ISA >=2.07 (probably removing unaligned handling to make code simpler if no negative performance impact)

  • improved formatting
  • solution to the problem outlined by luporl

The alignment problem is solved, but this patch makes bcopy performance go down by a factor of 10. I think a third version of the code should be considered, restricting this patch only to POWER7 CPUs.

Does it produce any gain over original non-optimized code on Power7? If there's no gain I'd suggest restrict optimizations to ISA >= v.2.07 (POWER8). And in this case, remove unaligned access handling since newer CPUs can handle it transparently.

If there's still a gain I agree with having two optimized versions one handling VSX & ISA <=2.06 and the other for VSX & ISA >=2.07 (probably removing unaligned handling to make code simpler if no negative performance impact)

Before going further with this discussion, it is important to realize that there are actually 2 issues with bcopy to be solved.

The first one is the use of VSX on POWER7 and older CPUs, that causes SIGBUS with unaligned memory.

The second one is unaligned access on cache-inhibited storage, which AFAIK is not allowed on ANY PowerPC CPU.

So, unless we ignore cache-inhibited storage, commonly used for stuff like framebuffers, we just can't allow unaligned accesses to memory.
That's why my suggestion was to do the same as was done with memcpy, of using only byte copies on unaligned memory.

Besides that, source must always be aligned, to prevent issues such as reading across page boundaries, when the upper page memory may not be accessible.

@bruno.larsen_eldorado.org.br, overall your changes look good to me.
I have only one more suggestion and some nitpicks.

lib/libc/powerpc64/string/bcopy.S
142

Style: add a space after /* and before */.

143–144

When src and dst are not aligned between themselves (%r6 == 0), you don't need to run most of the code between .Lavoid_vsx and here, especially the align loop, since data will be copied byte by byte.

IMHO, when detecting that src and dst are not aligned, at the beginning of bcopy, you could just branch to a new label, setup increment/decrement and ctr, advance src and dst to end (in the backward copy case) and jump to single_1_loop.

This may improve performance a bit.

184

s/ever/every/

Before going further with this discussion, it is important to realize that there are actually 2 issues with bcopy to be solved.

The first one is the use of VSX on POWER7 and older CPUs, that causes SIGBUS with unaligned memory.

The second one is unaligned access on cache-inhibited storage, which AFAIK is not allowed on ANY PowerPC CPU.

So, unless we ignore cache-inhibited storage, commonly used for stuff like framebuffers, we just can't allow unaligned accesses to memory.
That's why my suggestion was to do the same as was done with memcpy, of using only byte copies on unaligned memory.

Besides that, source must always be aligned, to prevent issues such as reading across page boundaries, when the upper page memory may not be accessible.

I just checked memcpy and noticed that it doesn't use doubleword copying, just jumps straight into byte copying. I'm wondering if the gain of using dword are enough to justify the loss of time for decision making. If it is, why isn't it part of memcpy? If it isn't, why is it here?

And testing for this is harder than just using funcperf, since it depends heavily on the percentage of code that goes to bcopy unaligned, and that test explicitly only aligns 1/64th of the code it sends to test different performance on all possible configurations, without trying to get an average for real world usage

EDIT: I just checked again and I think I'm blind. It's there, but it is word, instead of dword. That's what got me confused, and I just realised I used the wrong alignment mask, so I'll be correcting that.

  • added new execution path in case src and dst are relatively unaligned
  • fixed the style issues

The change in execution path has marginal performance improvements, but it's easier to see what is going on.

It looks great now! Since this changes core libc functions, can you perform a buildworld on a system with the modified libc installed to confirm everything works fine?

This revision is now accepted and ready to land.Feb 24 2021, 12:36 PM

It looks great now! Since this changes core libc functions, can you perform a buildworld on a system with the modified libc installed to confirm everything works fine?

Just tried installing it on a fresh VM and performing a buildworld. Worked without a hitch