powerpc/powerpc64: Enforce natural alignment in bcopy
ClosedPublic
Actions

Authored by bruno.larsen_eldorado.org.br on Feb 18 2021, 5:15 PM.

Details

Reviewers

jhibbits
luporl
alfredo

Commits

rG1805ce694542: powerpc64: enforce natural alignment in bcopy
rG2f561284033c: [PowerPC64] Enforce natural alignment in bcopy

Summary

POWER architecture CPUs (Book-S) require natural alignment for cache-inhibited storage accesses. Since we can't know the caching model for a page ahead of time, always enforce natural alignment in bcopy. This fixes a SIGBUS when calling the function with misaligned pointers on POWER7

Test Plan

The SIGBUS can be triggered by using lffpires' funcperf without the fix, and it stops with this fix

Diff Detail

Repository

rG FreeBSD src repository

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

bruno.larsen_eldorado.org.br created this revision.Feb 18 2021, 5:15 PM

Herald added subscribers: Contributor Reviews (src), imp. · View Herald TranscriptFeb 18 2021, 5:15 PM

bruno.larsen_eldorado.org.br requested review of this revision.Feb 18 2021, 5:15 PM

Harbormaster completed remote builds in B37163: Diff 84209.Feb 18 2021, 5:15 PM

bruno.larsen_eldorado.org.br added reviewers: jhibbits, luporl, alfredo.Feb 18 2021, 6:13 PM

bruno.larsen_eldorado.org.br added a project: PowerPC.

Herald added a subscriber: bdragon. · View Herald TranscriptFeb 18 2021, 6:13 PM

luporl added inline comments.Feb 18 2021, 7:13 PM

lib/libc/powerpc64/string/bcopy.S
104–106	.Lavoid_vsx leads to .Lsingle_copy, that end up copying double words, that require 8 byte alignment with cache-inhibited storage. When memory is not aligned you probably want to branch to .Lsingle_1 instead (it may need some adaptations though), that copies bytes.

improved formatting
solution to the problem outlined by luporl

The alignment problem is solved, but this patch makes bcopy performance go down by a factor of 10. I think a third version of the code should be considered, restricting this patch only to POWER7 CPUs.

Harbormaster completed remote builds in B37271: Diff 84457.Feb 22 2021, 2:46 PM

In D28776#645964, @bruno.larsen_eldorado.org.br wrote:

improved formatting

solution to the problem outlined by luporl

The alignment problem is solved, but this patch makes bcopy performance go down by a factor of 10. I think a third version of the code should be considered, restricting this patch only to POWER7 CPUs.

Does it produce any gain over original non-optimized code on Power7? If there's no gain I'd suggest restrict optimizations to ISA >= v.2.07 (POWER8). And in this case, remove unaligned access handling since newer CPUs can handle it transparently.

If there's still a gain I agree with having two optimized versions one handling VSX & ISA <=2.06 and the other for VSX & ISA >=2.07 (probably removing unaligned handling to make code simpler if no negative performance impact)

In D28776#646046, @alfredo wrote:

In D28776#645964, @bruno.larsen_eldorado.org.br wrote:

improved formatting

solution to the problem outlined by luporl

The alignment problem is solved, but this patch makes bcopy performance go down by a factor of 10. I think a third version of the code should be considered, restricting this patch only to POWER7 CPUs.

Does it produce any gain over original non-optimized code on Power7? If there's no gain I'd suggest restrict optimizations to ISA >= v.2.07 (POWER8). And in this case, remove unaligned access handling since newer CPUs can handle it transparently.

If there's still a gain I agree with having two optimized versions one handling VSX & ISA <=2.06 and the other for VSX & ISA >=2.07 (probably removing unaligned handling to make code simpler if no negative performance impact)

Before going further with this discussion, it is important to realize that there are actually 2 issues with bcopy to be solved.

The first one is the use of VSX on POWER7 and older CPUs, that causes SIGBUS with unaligned memory.

The second one is unaligned access on cache-inhibited storage, which AFAIK is not allowed on ANY PowerPC CPU.

So, unless we ignore cache-inhibited storage, commonly used for stuff like framebuffers, we just can't allow unaligned accesses to memory.
That's why my suggestion was to do the same as was done with memcpy, of using only byte copies on unaligned memory.

Besides that, source must always be aligned, to prevent issues such as reading across page boundaries, when the upper page memory may not be accessible.

@bruno.larsen_eldorado.org.br, overall your changes look good to me.
I have only one more suggestion and some nitpicks.

lib/libc/powerpc64/string/bcopy.S
143	Style: add a space after /* and before */.
144–145	When src and dst are not aligned between themselves (%r6 == 0), you don't need to run most of the code between .Lavoid_vsx and here, especially the align loop, since data will be copied byte by byte. IMHO, when detecting that src and dst are not aligned, at the beginning of bcopy, you could just branch to a new label, setup increment/decrement and ctr, advance src and dst to end (in the backward copy case) and jump to single_1_loop. This may improve performance a bit.
185	s/ever/every/

In D28776#646080, @luporl wrote:

Before going further with this discussion, it is important to realize that there are actually 2 issues with bcopy to be solved.

The first one is the use of VSX on POWER7 and older CPUs, that causes SIGBUS with unaligned memory.

The second one is unaligned access on cache-inhibited storage, which AFAIK is not allowed on ANY PowerPC CPU.

So, unless we ignore cache-inhibited storage, commonly used for stuff like framebuffers, we just can't allow unaligned accesses to memory.
That's why my suggestion was to do the same as was done with memcpy, of using only byte copies on unaligned memory.

Besides that, source must always be aligned, to prevent issues such as reading across page boundaries, when the upper page memory may not be accessible.

I just checked memcpy and noticed that it doesn't use doubleword copying, just jumps straight into byte copying. I'm wondering if the gain of using dword are enough to justify the loss of time for decision making. If it is, why isn't it part of memcpy? If it isn't, why is it here?

And testing for this is harder than just using funcperf, since it depends heavily on the percentage of code that goes to bcopy unaligned, and that test explicitly only aligns 1/64th of the code it sends to test different performance on all possible configurations, without trying to get an average for real world usage

EDIT: I just checked again and I think I'm blind. It's there, but it is word, instead of dword. That's what got me confused, and I just realised I used the wrong alignment mask, so I'll be correcting that.

added new execution path in case src and dst are relatively unaligned
fixed the style issues

The change in execution path has marginal performance improvements, but it's easier to see what is going on.

Harbormaster completed remote builds in B37312: Diff 84550.Feb 23 2021, 2:47 PM

It looks great now! Since this changes core libc functions, can you perform a buildworld on a system with the modified libc installed to confirm everything works fine?

This revision is now accepted and ready to land.Feb 24 2021, 12:36 PM

In D28776#647140, @luporl wrote:

It looks great now! Since this changes core libc functions, can you perform a buildworld on a system with the modified libc installed to confirm everything works fine?

Just tried installing it on a fresh VM and performing a buildworld. Worked without a hitch

Closed by commit rG2f561284033c: [PowerPC64] Enforce natural alignment in bcopy (authored by luporl). · Explain WhyMar 25 2021, 4:11 PM

This revision was automatically updated to reflect the committed changes.

luporl added a commit: rG2f561284033c: [PowerPC64] Enforce natural alignment in bcopy.

luporl added a commit: rG1805ce694542: powerpc64: enforce natural alignment in bcopy.Apr 6 2021, 11:30 AM