Page MenuHomeFreeBSD

ARM64 copyinout improvements
ClosedPublic

Authored by wma on Mar 17 2016, 12:26 PM.
Tags
None
Referenced Files
Unknown Object (File)
Thu, Jan 9, 11:11 AM
Unknown Object (File)
Wed, Dec 25, 5:34 AM
Unknown Object (File)
Thu, Dec 19, 11:40 PM
Unknown Object (File)
Wed, Dec 18, 1:15 PM
Unknown Object (File)
Oct 22 2024, 5:21 AM
Unknown Object (File)
Oct 22 2024, 1:50 AM
Unknown Object (File)
Sep 30 2024, 9:41 PM
Unknown Object (File)
Sep 24 2024, 2:48 PM
Subscribers

Details

Summary

The first of set of patches.
Use wider load/stores when aligned buffer is being copied.

In a simple test:

dd if=/dev/zero of=/dev/null bs=1M count=1024

the performance jumped from 410MB/s up to 3.6GB/s.

TODO:

  • better handling of unaligned buffers (WiP)
  • implement similar mechanism to bzero

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Not Applicable
Unit
Tests Not Applicable

Event Timeline

wma retitled this revision from to ARM64 copyinout improvements.
wma updated this object.
wma edited the test plan for this revision. (Show Details)
wma added reviewers: zbb, der_semihalf.com, andrew.
wma set the repository for this revision to rS FreeBSD src repository - subversion.

What benchmarks have you run, and on what hardware?

Also be aware that I expect to be rewriting this code in a few months.

A 'dd' test mentioned before, for example, run on ThunderX. All changes are also tested on generic Cortex-A57.
The code is optimized for minimizing concurrent memory accesses, because, in opposite to other vendors, ThunderX core is very sensitive to that aspect - the change seemed to be simple, but some hw restriction made it tricky. That's why the whole rework was split into 3 patches, where this is the first one. I can provide you with more details on priv if you still have a valid NDA with Cavium.

Generally speaking, I'd like to have this change applied before the code freeze for rel-11. I can take over improvements in copyin/out and bzero for now taking ThunderX and A57 as reference platforms (that's what we have locally).

I'd like to have some hint as to the reason this is done in a comment in the source, so that someone reading this in the future doesn't change it without understanding why it's this way.

Can you test on a Cortex-A53? Both of the cores you tested on have out-of-order pipelines where the A53 has an in-order pipeline. I would expect it to also be faster, but it would be useful to check.

Can you also try with smaller bs values in dd to show how the performance changes for buffer sizes. You can also use ministat to see the magnitude of the performance difference between two banchmarks.

sys/arm64/arm64/copyinout.S
160–163 ↗(On Diff #14389)

Why 4 loads and not, for example, 2 or 8?

I have experimented with different amounts of overlap here and thought I got some improvement with a little overlap of the load and store instructions, but may be misremembering the details.

183 ↗(On Diff #14389)

Only on p in copy.

Can you test on a Cortex-A53? Both of the cores you tested on have out-of-order pipelines where the A53 has an in-order pipeline. I would expect it to also be faster, but it would be useful to check.

This might be a useful case (later on) for ifunc-like kernel support if it turns out that we'd benefit from different implementations for A57/ThunderX and A53.

Sure thing, will try to run some tests on R-Pi3 also, but since this is not critical for this patch, that might take a while. I'm just busy with userspace crashes now.

sys/arm64/arm64/copyinout.S
160–163 ↗(On Diff #14389)

Eventually it's going to be 8, but for now the most generic value is 4. We'll change that in patch #2 or #3.

The overlap you've mentioned is a reasonable assumption. However, on ThunderX we need to avoid accessing different cachelines simultaneously. I've send you an email describing why.

Can you also try with smaller bs values in dd to show how the performance changes for buffer sizes.

I have tested bs from 4096 to above 1024kB. the 3.6GB has been achievend with 512kB block size; increasing the block size above that does not seem to improve the performance.

Log:
for ((i=4; i < 2048; i *= 2)); do echo "=== Block size is "dc -e"$i 1024 * p"" ==="; dd if=/dev/zero of=/dev/null bs=dc -e "$i 1024 * p" count=4096; done

Block size is 4096

4096+0 records in
4096+0 records out
16777216 bytes transferred in 0.010160 secs (1651279659 bytes/sec)

Block size is 8192

4096+0 records in
4096+0 records out
33554432 bytes transferred in 0.017451 secs (1922789814 bytes/sec)

Block size is 16384

4096+0 records in
4096+0 records out
67108864 bytes transferred in 0.026118 secs (2569427166 bytes/sec)

Block size is 32768

4096+0 records in
4096+0 records out
134217728 bytes transferred in 0.043755 secs (3067460780 bytes/sec)

Block size is 65536

4096+0 records in
4096+0 records out
268435456 bytes transferred in 0.080338 secs (3341319859 bytes/sec)

Block size is 131072

4096+0 records in
4096+0 records out
536870912 bytes transferred in 0.152331 secs (3524367685 bytes/sec)

Block size is 262144

4096+0 records in
4096+0 records out
1073741824 bytes transferred in 0.296268 secs (3624227828 bytes/sec)

Block size is 524288

4096+0 records in
4096+0 records out
2147483648 bytes transferred in 0.584010 secs (3677137204 bytes/sec)

Block size is 1048576

4096+0 records in
4096+0 records out
4294967296 bytes transferred in 1.173694 secs (3659358841 bytes/sec)

sys/arm64/arm64/copyinout.S
160–163 ↗(On Diff #14389)

I have tested between 2, 4 and 8 . There is ~4% improvement when making 128 (8 pair) read then write on the board I am using.
Approach with prefetching 64 bytes on registers and then interlacing write/reads got slightly (~2%) decrease in performance.

qword_by_qword does the 2 pair copy for buffers to small for "block" transfer.

kib edited edge metadata.

About ifunc, I have some work in progress which basically emulates ifunc syntaxically on x86, internally the implementation uses pointers instead of doing the relocations in place. It is easy to adopt that to other arches, but I believe that right now arm64 is the only arch which toolchain actually supports ifuncs (except that kernel relocator does not).

This revision is now accepted and ready to land.Mar 18 2016, 1:58 AM
wma edited edge metadata.
This revision now requires review to proceed.Mar 22 2016, 12:35 PM
This revision was automatically updated to reflect the committed changes.

Just for the record, the speedup on Andrew's image for RPi3 is marginal.

Old copyinout:

root@rpi3:~ # dd if=/dev/zero of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 6.358872 secs (168857285 bytes/sec)
root@rpi3:~ #

New copyinout:

root@rpi3:~ # dd if=/dev/zero of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 5.564260 secs (192971177 bytes/sec)
root@rpi3:~ #

On A57 is similar to ThunderX, i.e. 10x faster than the old one.

Ech.. sorry for the confusion, I run the wrong image. As expected, the speedup is similar on Rpi3 also.

root@rpi3:~ # dd if=/dev/zero of=/dev/null bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes transferred in 0.712963 secs (1506026600 bytes/sec)
root@rpi3:~ #
head/sys/arm64/arm64/copyinout.S
136

Why 4? This will align to 2^4 bytes.