On Thu, Jan 4, 2018 at 2:15 PM, Niels Möller nisse@lysator.liu.se wrote:
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
If I had to chose between optimizing one of two, I'd say CTR.
I agree CTR seems more important. I'm guessing that the loop
for (p = dst, left = length; left >= block_size; left -= block_size, p += block_size) { memcpy (p, ctr, block_size); INCREMENT(block_size, ctr); }
in ctr_crypt contribudes quite a few cycles per byte. It would be faster to use an always word-aligned area, and do the copying and incrementing using word operations (and final byteswap when running on a little-endian platform), and with no intermediate stores.
Would be a pretty simple routine (maybe we don't even need to go to assembly) if we require that the block size is a multiple of sizeof(unsigned long), and even simpler if we restrict to block size 16. But uglier and less efficient, if it needs to support the general case.
Maybe we could have a special case for blocksize 16, and accept that unusual blocksizes will be much slower. Or could we drop support for all but the most relevant block sizes here?
I wouldn't expect if anyone uses 3des in CTR mode, but I wouldn't be surprised by it either. What about introducing ctr_crypt128() and having it used by CCM, and EAX? (it seems gcm is not using it anyway)
regards, Nikos