nisse@lysator.liu.se (Niels Möller) writes:
I agree CTR seems more important. I'm guessing that the loop
for (p = dst, left = length; left >= block_size; left -= block_size, p += block_size) { memcpy (p, ctr, block_size); INCREMENT(block_size, ctr); }
in ctr_crypt contribudes quite a few cycles per byte. It would be faster to use an always word-aligned area, and do the copying and incrementing using word operations (and final byteswap when running on a little-endian platform), and with no intermediate stores.
I've tried this, with special code for block size 16. (Without any assembly, but using __builtin_bswap64). Pushed to the ctr-opt branch. Gives a nice speedup. On my machine:
Nettle-3.4:
Algorithm mode Mbyte/s cycles/byte cycles/block
aes128 ECB encrypt 1589.75 1.26 20.16 aes128 ECB decrypt 1642.91 1.22 19.50 aes128 CBC encrypt 354.43 5.65 90.41 aes128 CBC decrypt 1519.10 1.32 21.09 aes128 (in-place) 1338.70 1.50 23.94 aes128 CTR 727.24 2.75 44.06 aes128 (in-place) 774.78 2.58 41.36
master branch:
Algorithm mode Mbyte/s cycles/byte cycles/block
aes128 ECB encrypt 3143.18 0.64 10.19 aes128 ECB decrypt 3159.88 0.63 10.14 aes128 CBC encrypt 351.37 5.70 91.20 aes128 CBC decrypt 2726.47 0.73 11.75 aes128 (in-place) 2131.99 0.94 15.03 aes128 CTR 970.08 2.06 33.03 aes128 (in-place) 796.31 2.51 40.24
ctr-opt branch:
Algorithm mode Mbyte/s cycles/byte cycles/block
aes128 ECB encrypt 3159.18 0.63 10.14 aes128 ECB decrypt 3159.82 0.63 10.14 aes128 CBC encrypt 351.80 5.69 91.08 aes128 CBC decrypt 2723.80 0.74 11.76 aes128 (in-place) 2156.27 0.93 14.86 aes128 CTR 1778.84 1.13 18.01 aes128 (in-place) 1550.39 1.29 20.67
Which means that aes128-ctr is twice as fast as in 3.4.
If anyone has a big-endian machine handy, it would be nice with additional testing for both correctness and performance (I have access to a few virtual machines with non-x86 architectures, where I can test this before merging to the master branch, but that's not so useful for benchmarking).
Regards, /Niels