James Cloos cloos@jhcloos.com writes:
To give an example of how high, openssh's C implementation of chacha20 with poly1305 is faster than openssl's non-aesni amd64 assembly for aes128-gcm, and both significantly outperform ssh's use of openssl's aes128-ctr or -ccb assembly with openssh's umac-64.
Benchmarking nettle's implementation on my office machine (core i5),
algorithm cycles/byte salsa20 5.3 aes128 11 aes128 22 (openssl) arcfour 7.5 arcfour 3.75 (openssl)
(For aes, I'm surprised by the big difference to openssl. Nettle's aes assembly is pretty basic, and on this machine it seems to give a very marginal improvement over the C implementation, which runs at 12 cycles/byte. Maybe something is fishy with the ubuntu openssl package, or there's some problem with my benchmarking).
Anyway, getting back to chacha, it will be interesting to see how much faster chacha is than salsa20.
If I remember the chacha changes correctly, one gets rid of a permutation of the matrix, and I think some of the rotations in the round function (done as movaps, pslld, psrld, pxor) can be replaced by a pshufd. I think that can reduce the instruction count for the round function by 25-50%, depending on how many of the rotations can be replaced (there ought to be at least one rotation left with a rotation count which isn't a multiple of 8).
like gcm, safer than most current usage of separate macs.
Are you saying that chacha + poly1305 is not used in the obvious way as a stream cipher + a separate mac? Care to elaborate?
Regards, /Niels