Page MenuHomeFreeBSD

arm64: bzero has been optimized
ClosedPublic

Authored by der_semihalf.com on Mar 24 2016, 11:29 AM.
Tags
Referenced Files
Unknown Object (File)
Tue, Nov 26, 4:52 PM
Unknown Object (File)
Tue, Nov 26, 2:14 PM
Unknown Object (File)
Mon, Nov 25, 6:11 AM
Unknown Object (File)
Nov 6 2024, 6:32 AM
Unknown Object (File)
Oct 6 2024, 4:33 AM
Unknown Object (File)
Sep 29 2024, 3:14 PM
Unknown Object (File)
Sep 29 2024, 8:00 AM
Unknown Object (File)
Sep 28 2024, 12:45 AM
Subscribers

Details

Summary

This optimization attempts to utylize as wide as possible register store instructions to zero large buffers.
The implementation, if possible, will use 'dc zva' to zero buffer by cache lines.

Test results from on Thunder:
bzero_old offset: 1, size 1, time 0.015625
bzero_old offset: 1, size 2, time 0.015625
bzero_old offset: 1, size 4, time 0.03125
bzero_old offset: 1, size 8, time 0.03125
bzero_old offset: 1, size 16, time 0.0546875
bzero_old offset: 1, size 32, time 0.09375
bzero_old offset: 1, size 64, time 0.164062
bzero_old offset: 1, size 128, time 0.3125
bzero_old offset: 1, size 256, time 0.617188
bzero_old offset: 1, size 512, time 1.21094
bzero_old offset: 1, size 1024, time 2.40625
bzero_old offset: 1, size 2048, time 4.80469
bzero_old offset: 1, size 4096, time 9.58594
bzero_c offset: 1, size 1, time 0.0546875
bzero_c offset: 1, size 2, time 0.0703125
bzero_c offset: 1, size 4, time 0.09375
bzero_c offset: 1, size 8, time 0.148438
bzero_c offset: 1, size 16, time 0.171875
bzero_c offset: 1, size 32, time 0.179688
bzero_c offset: 1, size 64, time 0.21875
bzero_c offset: 1, size 128, time 0.25
bzero_c offset: 1, size 256, time 0.296875
bzero_c offset: 1, size 512, time 0.40625
bzero_c offset: 1, size 1024, time 0.609375
bzero_c offset: 1, size 2048, time 1.01562
bzero_c offset: 1, size 4096, time 1.84375
bzero_new offset: 1, size 1, time 0.0234375
bzero_new offset: 1, size 2, time 0.03125
bzero_new offset: 1, size 4, time 0.03125
bzero_new offset: 1, size 8, time 0.0390625
bzero_new offset: 1, size 16, time 0.03125
bzero_new offset: 1, size 32, time 0.0390625
bzero_new offset: 1, size 64, time 0.0390625
bzero_new offset: 1, size 128, time 0.078125
bzero_new offset: 1, size 256, time 0.0859375
bzero_new offset: 1, size 512, time 0.09375
bzero_new offset: 1, size 1024, time 0.09375
bzero_new offset: 1, size 2048, time 0.140625
bzero_new offset: 1, size 4096, time 0.164062

For 1024^2 bzero calls on buffer with various sizes. Buffer pointer has been cache line size aligned and moved by offset (here 1 which is worst case scenario ). bzero_old is previous implementation, bzero_c is C implementation as taken from PowerPC and bzero_new is new implementation.

Diff Detail

Repository
rS FreeBSD src repository - subversion
Lint
Lint Skipped
Unit
Tests Skipped

Event Timeline

der_semihalf.com retitled this revision from to arm64: bzero has been optimized.
der_semihalf.com updated this object.
der_semihalf.com edited the test plan for this revision. (Show Details)
der_semihalf.com added reviewers: kib, wma, zbb, andrew.
der_semihalf.com set the repository for this revision to rS FreeBSD src repository - subversion.
der_semihalf.com added a project: arm64.
sys/arm64/arm64/bzero.S
44

Suggest: if size is <= 8 goto lead_out and skip the rest.

46

-> zero

67

Can we simplify this part? I don't like using mul and div in performance critical code.

Let's assume

mrs x7, DCZID_EL0
and x7, x7, 0x0f
add x7, #2 /* x7 is log2 number of bytes in cache line */
lsl x2, #1, x7
sub x2, x2, #1 /* cacheline bitmask */

sub x4, x7, x0
and x4, x4, x2

sub x5, x1, x4
lsr x5, x5, x7

add x6, x2, #1 /* cacheline size */
lsl x6, x6, x5
sub x6, x1, x6
sub x6, x6, x4

<lead_in>
<dc zero>
<lead_out>

162

We can move cacheline_zero to a separate function and use it directly in page_zero instead of calling bzero there. That should give us performance increase because we omit all alignment checking in place where it is guaranteed.

sys/conf/files.arm64
39

Alphabetical order.

sys/arm64/arm64/bzero.S
43

Do we check if the pointer is NULL on other architectures?

50–57

You should read the size once in cache_setup(), then use the size read there in bzero.

sys/arm64/arm64/bzero.S
43

No, and such check is IMO wrong in principle.

sys/arm64/arm64/bzero.S
3

Extra '*'

Review comments have been applied with following differences:

  • will jump directly to normal for buffers with size <= 16 rather than 8 as suggested by wma
  • function has not been split because optimized pagezero is in work; which is much more simplier than bzero

Do you have any more comments regarding the latest changes?
I would like to move forward with this and need to know if you are OK with the soultion or need some changes to it.

I would fix comments before committing.
For example

/* Buffer must be larger than cache line for using cache zeroing
         * (and cache line aligned but this is checked after jump) */

to

/*
 * Buffer must be larger than cache line for using cache zeroing
 * (and cache line aligned but this is checked after jump)
 */

Do you have any more comments regarding the latest changes?
I would like to move forward with this and need to know if you are OK with the soultion or need some changes to it.

Initial NULL check is still there.

In D5726#123719, @kib wrote:

Initial NULL check is still there.

I am only checking size to be 0 - to awoid any work. I have to do it either here or before "normal" otherwise normal will attempt to perform at least one loop run, since checking is at the end of loop, and then will overflow and keep running.

I am only checking size to be 0 - to awoid any work. I have to do it either here or before "normal" otherwise normal will attempt to perform at least one loop run, since checking is at the end of loop, and then will overflow and keep running.

I misread the code, sorry.

sys/arm64/arm64/bzero.S
44–45
/*
 * Multi-line comments look like this.  Make them real sentences.  Fill
 * them so they look like real paragraphs.
 */

(from style(9))

48

Why?

63–64

I'm not sure how to parse this sentence.

der_semihalf.com edited edge metadata.

Reviewers comments have been applied.

Something more to fix?

sys/arm64/arm64/bzero.S
48

"normal" is used for: buffers shorter than 16bytes, to align address before using dc zva (stp) and to fill in what have left after dc zva (or last possible stp). x5 is calculated in 85 for buffers that are longer than 16 bytes and of course may be 0 after calculations (if size is not enough to use dc zva); in 175 we are checking if x5 is 0 - this is after execution of "normal" has ended - if we would be executing normal for buffers that are shorter than 16 bytes and x5 would not be set to 0 at 48 (before executing 52) then, after completion of execution for buffers with size shorter than 16 bytes, x5 value is random and we may attempt to perform "cache_line_zero".

sys/arm64/arm64/bzero.S
48

Please convert this to a comment and add to the source code.

Explanation for x5 zeroing before jumping to "normal" for buffers of size <= 16 has been added.

kib edited edge metadata.
This revision is now accepted and ready to land.Apr 1 2016, 11:54 AM
This revision was automatically updated to reflect the committed changes.