Currently mergesort() uses ICOPY_*() to copy data as four byte blocks
instead of one byte. However, this is only achievable when both size and
base arguments are aligned to four bytes.
Use of memcpy() is ideal as 1) it is cleaner and 2) the library will use
SIMD for copying when the hardware supports it. Compared to ICOPY_*(),
SIMD can support up to 64 bytes. When the SIMD-backed memcpy() find the
address is unaligned, it can first copy data up to the nearest aligned
address, and then use SIMD operations for faster transfer. Thus memcpy()
can give better performance than mergesort()'s own implementation.
This is benchmarked on amd64 where there isn't a SIMD-backed
implementation yet. However, the baseline implementation in assembly
already delivers better performance in unaligned cases although there is
some performance drops in aligned cases. The benchmark results and
script is available in the Phabricator review. Ideally, more performance
improvements will come when amd64 gets SIMD implementation of memcpy().
Signed-off-by: Minsoo Choo <minsoochoo0122@proton.me>