Some hardware supports AES acceleration but not SHA1, e.g., AES-NI
extensions. It is useful to have accelerated AES even if SHA1 must be
software.
Suggested by: asomers
Sponsored by: EMC / Isilon Storage Division
Differential D5146
kcrypto_aes: Use separate sessions for AES and SHA1 cem on Jan 30 2016, 10:59 PM. Authored by Tags None Referenced Files
Subscribers
Details Some hardware supports AES acceleration but not SHA1, e.g., AES-NI Suggested by: asomers
Diff Detail
Event Timeline
Comment Actions Looks good to me.
Comment Actions I tested this with a dual Xeon-E5-2620 system as my server and a dual Xeon-E5-2643 as my client. Both systems support AESNI. My test was a simple dd of a 3.6 GB file full of zeros. I set vfs.nfsd.async=1 and used a 5-disk RAIDZ1 pool of 7200 RPM SAS drives.
In conclusion, the patch gave a serious speed improvement for krb5p, the only security level that uses AES on the whole payload. However, krb5i, which does a SHA1 of the entire RPC payload, is still slow. "openssl speed -evp sha1" gives 363 MBps on the server, suggesting that there's still room for improvement. But that's a problem for another day. Comment Actions I get widely varying openssl SHA1 results on my Ivybridge laptop depending on block size. Any idea what read/write block size your NFS mount is using? So, krb5i isn't going to be any better than krb5. In your results, writes are a little worse than vanilla krb5, but reads are a *lot* worse. I wonder why that is. We seem to use opencrypto/cryptosoft.c -> opencrypto/xform_sha1.c -> crypto/sha1.c. I wonder if openssl has a better sha1 implementation. It's also possible openssl (in userspace) is free to use e.g. SSE instructions that mutate floating point state, while the kernel must make trade-offs around using the floating point registers (and errs on the side of not using them). Comment Actions Dunno. I'm using the defaults. The man page for nfs_mount suggests that rsize and wsize only matter for UDP mounts, which mine is not. But I don't think it will make a big difference. Even at 1kB blocks, my server gets 321 MBps with openssl.
SHA1 has two phases, the message scheduling phase and the compressor phase. The scheduling phase can be accelerated with SIMD, and openssl does. The compressor phase, however, cannot be vectorized, and it takes longer than the scheduling phase. Openssl implements it in assembly, but I doubt their implementation for the compressor phase is much faster than a C implementation. I know from experience that Openssl's assembly MD5 function is the same speed as a decent C version, and MD5's compressor is similar to SHA1's. So we could probably get some speedup by using SIMD for our SHA1 message scheduler, but probably no more than a 10% boost. Skylake has instruction extensions for SHA1, but I don't have any Skylake hardware to play with. Comment Actions I'll leave that work for another day. :)
Yeah, me either. |