The existing copyin(9) and copyout(9) routines on RISC-V perform only a
simple byte-by-byte copy. Improve their performance by performing
word-sized copies where possible.
Overall approach: For best performance, all load's and stores must occur
on their native boundary, i.e. a 64-bit load must occur on a 64-bit
aligned address. Misaligned loads and stores are possible, but they
require trapping into the SBI, which introduces an even bigger overhead.
Therefore, we will perform word-sized loads and stores only where
possible.
For cases where the source and destination addresses are not aligned to
each other, we have no choice but to do a byte-by- byte copy for the
entire thing. So long as this is not the case, then we can perform word
copy for some or all of the buffer. In some cases this will require byte
copy at the beginning or end to account for addresses that are not
initially word-aligned or any remainder due to buffer length.
don't add this new line