gcm_*_aesni() are used when the AVX512 implementation is not available.
Fix two bugs which manifest when handling operations spanning multiple
segments:
- Avoid underflow when the length of the input is smaller than the residual.
- In gcm_decrypt_aesni(), ensure that we begin the operation at the right offset into the input and output buffers.