On Tue, 2018-01-09 at 08:29 +0100, Niels Möller wrote:
nisse@lysator.liu.se (Niels Möller) writes:
I agree CTR seems more important. I'm guessing that the loop
for (p = dst, left = length; left >= block_size; left -= block_size, p += block_size) { memcpy (p, ctr, block_size); INCREMENT(block_size, ctr); }
in ctr_crypt contribudes quite a few cycles per byte. It would be faster to use an always word-aligned area, and do the copying and incrementing using word operations (and final byteswap when running on a little-endian platform), and with no intermediate stores.
I've tried this, with special code for block size 16. (Without any assembly, but using __builtin_bswap64). Pushed to the ctr-opt branch. Gives a nice speedup. On my machine:
I see a quite large speedup on my x86_64 too on CTR. Note however that GCM performance is not affected.
regards, Nikos