This optimization attempts to utylize as wide as possible register store instructions to zero large buffers.
The implementation, if possible, will use 'dc zva' to zero buffer by cache lines.
Test results from on Thunder:
bzero_old offset: 1, size 1, time 0.015625
bzero_old offset: 1, size 2, time 0.015625
bzero_old offset: 1, size 4, time 0.03125
bzero_old offset: 1, size 8, time 0.03125
bzero_old offset: 1, size 16, time 0.0546875
bzero_old offset: 1, size 32, time 0.09375
bzero_old offset: 1, size 64, time 0.164062
bzero_old offset: 1, size 128, time 0.3125
bzero_old offset: 1, size 256, time 0.617188
bzero_old offset: 1, size 512, time 1.21094
bzero_old offset: 1, size 1024, time 2.40625
bzero_old offset: 1, size 2048, time 4.80469
bzero_old offset: 1, size 4096, time 9.58594
bzero_c offset: 1, size 1, time 0.0546875
bzero_c offset: 1, size 2, time 0.0703125
bzero_c offset: 1, size 4, time 0.09375
bzero_c offset: 1, size 8, time 0.148438
bzero_c offset: 1, size 16, time 0.171875
bzero_c offset: 1, size 32, time 0.179688
bzero_c offset: 1, size 64, time 0.21875
bzero_c offset: 1, size 128, time 0.25
bzero_c offset: 1, size 256, time 0.296875
bzero_c offset: 1, size 512, time 0.40625
bzero_c offset: 1, size 1024, time 0.609375
bzero_c offset: 1, size 2048, time 1.01562
bzero_c offset: 1, size 4096, time 1.84375
bzero_new offset: 1, size 1, time 0.0234375
bzero_new offset: 1, size 2, time 0.03125
bzero_new offset: 1, size 4, time 0.03125
bzero_new offset: 1, size 8, time 0.0390625
bzero_new offset: 1, size 16, time 0.03125
bzero_new offset: 1, size 32, time 0.0390625
bzero_new offset: 1, size 64, time 0.0390625
bzero_new offset: 1, size 128, time 0.078125
bzero_new offset: 1, size 256, time 0.0859375
bzero_new offset: 1, size 512, time 0.09375
bzero_new offset: 1, size 1024, time 0.09375
bzero_new offset: 1, size 2048, time 0.140625
bzero_new offset: 1, size 4096, time 0.164062
For 1024^2 bzero calls on buffer with various sizes. Buffer pointer has been cache line size aligned and moved by offset (here 1 which is worst case scenario ). bzero_old is previous implementation, bzero_c is C implementation as taken from PowerPC and bzero_new is new implementation.