"Daniel P. Berrange" berrange@redhat.com writes:
I wrote a crude/simple test program to compare the performance of AES-128-CBC across openssl, gcrypt, nettle and gnutls, and was surprised to find that nettle is consistently ~25% slower than the other libraries for its AESNI implementation.
I've now pushed new aesni code to the master-updates branch. It reads all subkeys into registers upfront, and unrolls the round loop. This brings a great speedup when calling the aes functions with many blocks at a time, but little difference when doing only one block at a time. Results for aes128, when benchmarkign on my machine (intel broadwell):
ECB encrypt and decrypt: About 90% speedup, from 1.25 cycles/byte to 0.65, about the same as openssl, or even *slightly* faster.
CBC encrypt: No significant change, about 5.7 cycles/byte. CBC decrypt: About 60% speedup, from 1,5 cycles/byte down to 0.93.
CTR mode: No significant change, about 2.5 cycles/byte.
I think it's reasonble to speed up CTR mode by passing more blocks per call to the encryption function (currently it does 4 blocks at a time), and maybe by some more efficient routine to generate the counter input.
To improve CBC would need some structural and possibly ugly changes.
For now, I don't have separate assembly functions for aes128, aes192 and aes256, and I've tried to organize it so that aes128 gets the least penalty for this generality. See https://git.lysator.liu.se/nettle/nettle/blob/master-updates/x86_64/aesni/ae...
I wonder if there are any chips that can execute two independent aesenc instructions in parallel? If so, it would be pretty straight forward to do two blocks at a time in parallel, doubling the speed for aes128 and aes192 (for aes256, we don't have enough registers for all 15 subkeys and two blocks of data).
Regards, /Niels