On Tue, 2018-01-09 at 09:17 +0100, Nikos Mavrogiannopoulos wrote:
in ctr_crypt contribudes quite a few cycles per byte. It would be faster to use an always word-aligned area, and do the copying and incrementing using word operations (and final byteswap when running on a little-endian platform), and with no intermediate stores.
I've tried this, with special code for block size 16. (Without any assembly, but using __builtin_bswap64). Pushed to the ctr-opt branch. Gives a nice speedup. On my machine:
I see a quite large speedup on my x86_64 too on CTR. Note however that GCM performance is not affected.
To follow up on this, gcm would get an 8% (on my system) speedup by switching gcm_crypt() with ctr_crypt(). With that change as is however, the 32-bit counter is replaced with an "unlimited" counter. Wouldn't introducing an assert on decrypt and encrypt length be sufficient to share that code?
regards, Nikos