Optimized assembly implementation of memcpy() for the RISC-V architecture.
The implementation has two paths:
- An aligned path - (dst - src) % 8 = 0, runs faster
- An unaligned path - (dst - src) % 8 != 0, runs slower
os: FreeBSD arch: riscv │ memcpy_baseline │ memcpy_scalar │ │ sec/op │ sec/op vs base │ 64Align8 851.6µ ± 1% 488.9µ ± 1% -42.59% (p=0.000 n=12) 4kAlign8 681.5µ ± 1% 255.1µ ± 2% -62.57% (p=0.000 n=12) 256kAlign8 273.0µ ± 2% 230.7µ ± 2% -15.50% (p=0.000 n=12) 16mAlign8 98.07m ± 0% 95.29m ± 0% -2.84% (p=0.000 n=12) 64UAlign 887.5µ ± 1% 531.6µ ± 1% -40.10% (p=0.000 n=12) 4kUAlign 725.6µ ± 1% 262.2µ ± 1% -63.87% (p=0.000 n=12) 256kUAlign 844.1µ ± 2% 322.8µ ± 0% -61.76% (p=0.000 n=12) 16mUAlign 134.9m ± 0% 101.2m ± 0% -24.97% (p=0.000 n=20) geomean 2.410m 1.371m -43.12% │ memcpy_baseline │ memcpy_scalar │ │ MiB/s │ MiB/s vs base │ 64Align8 293.6 ± 1% 511.3 ± 1% +74.18% (p=0.000 n=12) 4kAlign8 366.8 ± 1% 980.0 ± 2% +167.15% (p=0.000 n=12) 256kAlign8 915.8 ± 2% 1083.7 ± 2% +18.34% (p=0.000 n=12) 16mAlign8 163.1 ± 0% 167.9 ± 0% +2.92% (p=0.000 n=12) 64UAlign 281.7 ± 1% 470.3 ± 1% +66.94% (p=0.000 n=12) 4kUAlign 344.5 ± 1% 953.6 ± 1% +176.77% (p=0.000 n=12) 256kUAlign 296.2 ± 2% 774.5 ± 0% +161.49% (p=0.000 n=12) 16mUAlign 118.6 ± 0% 158.1 ± 0% +33.28% (p=0.000 n=20) geomean 293.4 515.8 +75.81%