Details

Reviewers

kib
zbb
wma
andrew

Commits

rS297536: arm64: bzero optimization

Summary

This optimization attempts to utylize as wide as possible register store instructions to zero large buffers.
The implementation, if possible, will use 'dc zva' to zero buffer by cache lines.

Test results from on Thunder:
bzero_old offset: 1, size 1, time 0.015625
bzero_old offset: 1, size 2, time 0.015625
bzero_old offset: 1, size 4, time 0.03125
bzero_old offset: 1, size 8, time 0.03125
bzero_old offset: 1, size 16, time 0.0546875
bzero_old offset: 1, size 32, time 0.09375
bzero_old offset: 1, size 64, time 0.164062
bzero_old offset: 1, size 128, time 0.3125
bzero_old offset: 1, size 256, time 0.617188
bzero_old offset: 1, size 512, time 1.21094
bzero_old offset: 1, size 1024, time 2.40625
bzero_old offset: 1, size 2048, time 4.80469
bzero_old offset: 1, size 4096, time 9.58594
bzero_c offset: 1, size 1, time 0.0546875
bzero_c offset: 1, size 2, time 0.0703125
bzero_c offset: 1, size 4, time 0.09375
bzero_c offset: 1, size 8, time 0.148438
bzero_c offset: 1, size 16, time 0.171875
bzero_c offset: 1, size 32, time 0.179688
bzero_c offset: 1, size 64, time 0.21875
bzero_c offset: 1, size 128, time 0.25
bzero_c offset: 1, size 256, time 0.296875
bzero_c offset: 1, size 512, time 0.40625
bzero_c offset: 1, size 1024, time 0.609375
bzero_c offset: 1, size 2048, time 1.01562
bzero_c offset: 1, size 4096, time 1.84375
bzero_new offset: 1, size 1, time 0.0234375
bzero_new offset: 1, size 2, time 0.03125
bzero_new offset: 1, size 4, time 0.03125
bzero_new offset: 1, size 8, time 0.0390625
bzero_new offset: 1, size 16, time 0.03125
bzero_new offset: 1, size 32, time 0.0390625
bzero_new offset: 1, size 64, time 0.0390625
bzero_new offset: 1, size 128, time 0.078125
bzero_new offset: 1, size 256, time 0.0859375
bzero_new offset: 1, size 512, time 0.09375
bzero_new offset: 1, size 1024, time 0.09375
bzero_new offset: 1, size 2048, time 0.140625
bzero_new offset: 1, size 4096, time 0.164062

For 1024^2 bzero calls on buffer with various sizes. Buffer pointer has been cache line size aligned and moved by offset (here 1 which is worst case scenario ). bzero_old is previous implementation, bzero_c is C implementation as taken from PowerPC and bzero_new is new implementation.

Diff Detail

Repository

rS FreeBSD src repository - subversion

Lint

Lint Not Applicable

Unit

Tests Not Applicable

Event Timeline

der_semihalf.com updated this revision to Diff 14570.Mar 24 2016, 11:29 AM

der_semihalf.com retitled this revision from to arm64: bzero has been optimized.

der_semihalf.com updated this object.

der_semihalf.com edited the test plan for this revision. (Show Details)

der_semihalf.com added reviewers: kib, wma, zbb, andrew.

der_semihalf.com set the repository for this revision to rS FreeBSD src repository - subversion.

der_semihalf.com added a project: arm64.

Herald added subscribers: emaste, imp. · View Herald TranscriptMar 24 2016, 11:29 AM

wma added inline comments.Mar 25 2016, 6:48 AM

sys/arm64/arm64/bzero.S
43 ↗	(On Diff #14570)	Suggest: if size is <= 8 goto lead_out and skip the rest.
45 ↗	(On Diff #14570)	-> zero
66 ↗	(On Diff #14570)	Can we simplify this part? I don't like using mul and div in performance critical code. Let's assume mrs x7, DCZID_EL0 and x7, x7, 0x0f add x7, #2 /* x7 is log2 number of bytes in cache line / lsl x2, #1, x7 sub x2, x2, #1 / cacheline bitmask / sub x4, x7, x0 and x4, x4, x2 sub x5, x1, x4 lsr x5, x5, x7 add x6, x2, #1 / cacheline size */ lsl x6, x6, x5 sub x6, x1, x6 sub x6, x6, x4 <lead_in> <dc zero> <lead_out>
161 ↗	(On Diff #14570)	We can move cacheline_zero to a separate function and use it directly in page_zero instead of calling bzero there. That should give us performance increase because we omit all alignment checking in place where it is guaranteed.

wma added inline comments.Mar 25 2016, 7:11 AM

sys/conf/files.arm64
38 ↗	(On Diff #14570)	Alphabetical order.

andrew added inline comments.Mar 25 2016, 8:32 AM

sys/arm64/arm64/bzero.S
42 ↗	(On Diff #14570)	Do we check if the pointer is NULL on other architectures?
49–56 ↗	(On Diff #14570)	You should read the size once in cache_setup(), then use the size read there in bzero.

kib added inline comments.Mar 25 2016, 9:48 AM

sys/arm64/arm64/bzero.S
42 ↗	(On Diff #14570)	No, and such check is IMO wrong in principle.

emaste added inline comments.Mar 25 2016, 11:46 AM

sys/arm64/arm64/bzero.S
2 ↗	(On Diff #14570)	Extra '*'

Review comments have been applied with following differences:

will jump directly to normal for buffers with size <= 16 rather than 8 as suggested by wma
function has not been split because optimized pagezero is in work; which is much more simplier than bzero

der_semihalf.com added a child revision: D5741: arm64: pagezero improvement.Mar 27 2016, 6:13 PM

Do you have any more comments regarding the latest changes?
I would like to move forward with this and need to know if you are OK with the soultion or need some changes to it.

I would fix comments before committing.
For example

/* Buffer must be larger than cache line for using cache zeroing
         * (and cache line aligned but this is checked after jump) */

/*
 * Buffer must be larger than cache line for using cache zeroing
 * (and cache line aligned but this is checked after jump)
 */

In D5726#123491, @der_semihalf.com wrote:

Do you have any more comments regarding the latest changes?
I would like to move forward with this and need to know if you are OK with the soultion or need some changes to it.

Initial NULL check is still there.

In D5726#123719, @kib wrote:

Initial NULL check is still there.

I am only checking size to be 0 - to awoid any work. I have to do it either here or before "normal" otherwise normal will attempt to perform at least one loop run, since checking is at the end of loop, and then will overflow and keep running.

In D5726#123730, @der_semihalf.com wrote:

I am only checking size to be 0 - to awoid any work. I have to do it either here or before "normal" otherwise normal will attempt to perform at least one loop run, since checking is at the end of loop, and then will overflow and keep running.

I misread the code, sorry.

andrew added inline comments.Mar 31 2016, 8:07 AM

sys/arm64/arm64/bzero.S
44–45 ↗	(On Diff #14607)	/* * Multi-line comments look like this. Make them real sentences. Fill * them so they look like real paragraphs. */ (from `style(9)`)
48 ↗	(On Diff #14607)	Why?
63–64 ↗	(On Diff #14607)	I'm not sure how to parse this sentence.

Reviewers comments have been applied.

Something more to fix?

sys/arm64/arm64/bzero.S
48 ↗	(On Diff #14607)	"normal" is used for: buffers shorter than 16bytes, to align address before using dc zva (stp) and to fill in what have left after dc zva (or last possible stp). x5 is calculated in 85 for buffers that are longer than 16 bytes and of course may be 0 after calculations (if size is not enough to use dc zva); in 175 we are checking if x5 is 0 - this is after execution of "normal" has ended - if we would be executing normal for buffers that are shorter than 16 bytes and x5 would not be set to 0 at 48 (before executing 52) then, after completion of execution for buffers with size shorter than 16 bytes, x5 value is random and we may attempt to perform "cache_line_zero".

kib added inline comments.Apr 1 2016, 10:34 AM

sys/arm64/arm64/bzero.S
48 ↗	(On Diff #14607)	Please convert this to a comment and add to the source code.

Explanation for x5 zeroing before jumping to "normal" for buffers of size <= 16 has been added.

kib accepted this revision.Apr 1 2016, 11:54 AM

kib edited edge metadata.

This revision is now accepted and ready to land.Apr 1 2016, 11:54 AM

Closed by commit rS297536: arm64: bzero optimization (authored by wma). · Explain WhyApr 4 2016, 7:06 AM

This revision was automatically updated to reflect the committed changes.

arm64: bzero has been optimized
ClosedPublic
Actions

Details

Diff Detail

Event Timeline

Revision Contents
Changeset List

Diff 14853

head/sys/arm64/arm64/machdep.c

head/sys/arm64/include/armreg.h

head/sys/conf/files.arm64

arm64: bzero has been optimizedClosedPublicActions

Details

Diff Detail

Event Timeline

Revision ContentsChangeset List

Diff 14853

head/sys/arm64/arm64/machdep.c

head/sys/arm64/include/armreg.h

head/sys/conf/files.arm64

arm64: bzero has been optimized
ClosedPublic
Actions

Revision Contents
Changeset List