-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aloha!
Niels Möller wrote:
Benchmarking nettle's implementation on my office machine (core i5),
algorithm cycles/byte salsa20 5.3 aes128 11 aes128 22 (openssl) arcfour 7.5 arcfour 3.75 (openssl)
Side issue: Pretty big difference in performance also for arcfour.
Anyway, getting back to chacha, it will be interesting to see how much faster chacha is than salsa20.
DJB and some other benchmarks shows anything from zero to 30% better performance. The chacha paper states some ideas about the difference in parallelability.
If I remember the chacha changes correctly, one gets rid of a permutation of the matrix, and I think some of the rotations in the round function (done as movaps, pslld, psrld, pxor) can be replaced by a pshufd. I think that can reduce the instruction count for the round function by 25-50%, depending on how many of the rotations can be replaced (there ought to be at least one rotation left with a rotation count which isn't a multiple of 8).
The big difference is that you update the variables in a QR twice during the QR processing, but the QR is more regular and can easily (easier) be scheduled with fewer register active in a given cycle.
The DR processing is more regular to allow easier parallelism. The tight spot is between QR3 and QR4 where x15 is used in both. Otherwise it is really the 4 separate QRs in each half of the DR that provides parallelism.
This is why I got a bit curious when you Niels stated: "And the particular change from 12 to 14 might add significant complexity to an optimized implementations with 4-way unrolling"
If we constrain ourselves to an even number of rounds I have a bit of a problem to see how that would add significant complexity since we still will be doing the DR processing the same way. I guess I'm missing something, but I have spent some time doodling and thinking on the dependency constraints in ChaCha since I've done a HW implementation:
https://github.com/secworks/swchacha
The current implementation does only contain a single QR, but will be extended with support for 2 and 4 parallel QRs. There is a good paper [0] on HW implementation of Salsa20 and ChaCha that shows depencency within the QR. Looking at the clock frequency achieved one can clearly see when the dependency between QR3 and QR4 happens.
Oh, and in that paper Salsa20 is actually neck and neck with or slightly faster than ChaCha. ;-)
[0] L. Henzen, F. Carbognani, N. Felber, and W. Fichtner. VLSI Hardware Evaluation of the Stream Ciphers Salsa20 and ChaCha, and the Compression Function Rumba.
- -- Med vänlig hälsning, Yours
Joachim Strömbergson - Alltid i harmonisk svängning. ======================================================================== Joachim Strömbergson Secworks AB joachim@secworks.se ========================================================================