Re: ChaCha stream cipher for Nettle available

13 Dec 2013


      -----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Aloha!
Niels Möller wrote:
...
Benchmarking nettle's implementation on my office machine (core i5),
algorithm	cycles/byte salsa20		5.3 aes128		11 aes128		22 (openssl) 
arcfour		7.5 arcfour		3.75 (openssl)
Side issue: Pretty big difference in performance also for arcfour.
...
Anyway, getting back to chacha, it will be interesting to see how
much faster chacha is than salsa20.
DJB and some other benchmarks shows anything from zero to 30% better
performance. The chacha paper states some ideas about the difference in
parallelability.
...
If I remember the chacha changes correctly, one gets rid of a 
permutation of the matrix, and I think some of the rotations in the 
round function (done as movaps, pslld, psrld, pxor) can be replaced
by a pshufd. I think that can reduce the instruction count for the
round function by 25-50%, depending on how many of the rotations can
be replaced (there ought to be at least one rotation left with a
rotation count which isn't a multiple of 8).
The big difference is that you update the variables in a QR twice during
the QR processing, but the QR is more regular and can easily (easier) be
scheduled with fewer register active in a given cycle.
The DR processing is more regular to allow easier parallelism. The tight
spot is between QR3 and QR4 where x15 is used in both. Otherwise it is
really the 4 separate QRs in each half of the DR that provides parallelism.
This is why I got a bit curious when you Niels stated: "And the
particular change from 12 to 14 might add significant complexity
to an optimized implementations with 4-way unrolling"
If we constrain ourselves to an even number of rounds I have a bit of a
problem to see how that would add significant complexity since we still
will be doing the DR processing the same way. I guess I'm missing
something, but I have spent some time doodling and thinking on the
dependency constraints in ChaCha since I've done a HW implementation:
https://github.com/secworks/swchacha
The current implementation does only contain a single QR, but will be
extended with support for 2 and 4 parallel QRs. There is a good paper
[0] on HW implementation of Salsa20 and ChaCha that shows depencency
within the QR. Looking at the clock frequency achieved one can clearly
see when the dependency between QR3 and QR4 happens.
Oh, and in that paper Salsa20 is actually neck and neck with or slightly
faster than ChaCha. ;-)
[0] L. Henzen, F. Carbognani, N. Felber, and W. Fichtner. VLSI Hardware
Evaluation of the Stream Ciphers Salsa20 and ChaCha, and the Compression
Function Rumba.
- -- 
Med vänlig hälsning, Yours
Joachim Strömbergson - Alltid i harmonisk svängning.
========================================================================
 Joachim Strömbergson          Secworks AB          joachim@secworks.se
========================================================================
...PGP SIGNATURE...
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAlKqwQ0ACgkQZoPr8HT30QFC6ACfcp5RTbFmIPxgFBfuwQ9VlOvq
PKoAoJUE3pM/O/es3OWxR8J3pHheLhQt
=L3hD
-----END PGP SIGNATURE-----

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: ChaCha stream cipher for Nettle available