The first of set of patches.
Use wider load/stores when aligned buffer is being copied.
In a simple test:
dd if=/dev/zero of=/dev/null bs=1M count=1024
the performance jumped from 410MB/s up to 3.6GB/s.
TODO:
- better handling of unaligned buffers (WiP)
- implement similar mechanism to bzero