-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aloha!
I've taken a shot at implementing the ChaCha stream cipher for Nettle. Nettle is a modified version of Salsa20 done by DJB in order to improve both performance (esp on CPUs with support for data parallelism) and somewhat improved performance. ChaCha has been suggested as replacement for RC4 both by Adam Langley from Google and separately by Nikos and Me.
The code in this implementation is heavily based on the Salsa20 implementation in Nettle. The major changes beside name changes are the quarterround schedules, the different state init and the quarterround. This implementation also supports different number of rounds.
There is a pretty simple test program that verifies the functionality for 128 and 256 keys as well as 8, 12 and 20 rounds using the testt vectors in the chacha test vectors draft:
http://tools.ietf.org/html/draft-strombergson-chacha-test-vectors-00
The code for the chacha implementation is available at:
https://github.com/secworks/nettle
The following files comprises the implementation: chacha-core-internal.c chacha-crypt.c chacha-init.c chacha.h
And the test program testsuite/chacha-test.c
(The other files are clones from Nettle to be able to build.)
ChaCha _should_ be a bit faster than Salsa20 and should esp be easier to optimize in asm for modern CPUs. I have however not done any benchmarks nor asm implementation (yet).
Since I'm new as a contributor I don't know how you Niels want to have patches. Please let me know if this looks good and something you want to integrate and if so how.
- -- Med vänlig hälsning, Yours
Joachim Strömbergson - Alltid i harmonisk svängning. ======================================================================== Joachim Strömbergson Secworks AB joachim@secworks.se ========================================================================
Joachim Strömbergson joachim@secworks.se writes:
I've taken a shot at implementing the ChaCha stream cipher for Nettle. Nettle is a modified version of Salsa20 done by DJB in order to improve both performance (esp on CPUs with support for data parallelism) and somewhat improved performance.
Cool! I looked briefly at ChaCha last time I worked on salsa20. I understand it has potential to be a bit faster than salsa20, so it will be interesting to see how that turns out.
This implementation also supports different number of rounds.
Which variants are recommended, or in real use? 20 and 12, just like for salsa20?
ChaCha _should_ be a bit faster than Salsa20 and should esp be easier to optimize in asm for modern CPUs. I have however not done any benchmarks nor asm implementation (yet).
Adding chacha benchmarking in examples/nettle-benchmark should be easy.
And if you like playing with either x86_64 sse2 (ugly) or arm neon (nicer), I think it's a not too difficult exercise to implement chacha based on the salsa20 assembly files (in the x86_64 and arm/neon directories).
Since I'm new as a contributor I don't know how you Niels want to have patches. Please let me know if this looks good and something you want to integrate and if so how.
I'm used to patches on the mailing list (I still feel a bit like a git newbie. I could also pull changes from a repository of yours, but I'd prefer a mailed patch unless I'm confident I want to integrate the work directly with no changes). An ideal patch set for chacha would include
* The implementation, more or less what you have now,
* A const struct nettle_cipher defined in chacha-meta.c, for each important variant (number of rounds and key size)
* A testcase following the conventions of the testsuite/*-c files.
* Integration in examples/nettle-benchmark.c (should be trivial). (Both benchmark and testcode would use the chacha-meta glue).
* Documentation for nettle.texinfo (but maybe that should wait until interface has settled).
* GNU-style ChangeLog entries for each change.
Preferably arranged so that independent changes (C implementation, docs, assembly implementation) can be applied one at a time. This is a wish list, to make integration quick and easy, but you don't have to get everything in order for the contribution to be useful.
I've only had a quick look at the actual code now, but my first impression is that it looks pretty good. I think I'd prefer to not have the number of rounds in the context, though, and instead have separate functions for different variants, possibly calling a common function taking the number of rounds as argument.
Regards, /Niels
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aloha!
Niels Möller wrote:
Which variants are recommended, or in real use? 20 and 12, just like for salsa20?
DJB does not state anything about recommended number of rounds but 8 i bare minimum. Both Adam Langley and we use 20 rounds:
http://tools.ietf.org/html/draft-agl-tls-chacha20poly1305-00
http://tools.ietf.org/html/draft-mavrogiannopoulos-chacha-tls-00
I assume that 12 is today the lower limit. But being able to use something in between 12 and 20, or more than 20 is good (imgho)
Adding chacha benchmarking in examples/nettle-benchmark should be easy.
Ok, I'll look at that.
I'm used to patches on the mailing list (I still feel a bit like a git newbie. I could also pull changes from a repository of yours, but I'd prefer a mailed patch unless I'm confident I want to integrate the work directly with no changes). An ideal patch set for chacha would include
I haven't worked that much with Gitorios and esp not the one at Lysator. But Git and es Github supports good mechanism for an owner/integrator such as you to receive merge requests from downstream clones, test them as branches and then decided if to include them as a whole or cherry pick.
I've used Gitorious before but really think Github is a much better service. This might be a stupid question but have you considered to move Nettle to Github? It would I assume get more exposure and probably more contributions. If that is what you want. ;-)
[list of deliverables]
Preferably arranged so that independent changes (C implementation, docs, assembly implementation) can be applied one at a time. This is a wish list, to make integration quick and easy, but you don't have to get everything in order for the contribution to be useful.
I'll try and get them done for you hopefully this week. A lot of peeking at previous paytches in the maillist archive and stealint/borrowing from other code (salsa20) should help me out. If not I'll post on the list.
I've only had a quick look at the actual code now, but my first impression is that it looks pretty good. I think I'd prefer to not have the number of rounds in the context, though, and instead have separate functions for different variants, possibly calling a common function taking the number of rounds as argument.
I can agree that the number of rounds is not really something to keep in the context. But I think that the solution of having the number of rounds fixed and having separate functions for the different versions is pretty ugly and inflexible.
I can see the need for applications to easily and dynamically change from 12 to 14 rounds by simply updating a variable, possibly even a loadable value instead of changing the code forcing a recompile.
The salsa20 implementation does not come with different function calls for 128 and 256 bit keys. Instead the length is given as a parameter. I don't see the number of rounds being that much different.
I suggest that the chacha_crypt function is accepting number of rounds as a parameter. You can then if you want add specific named functions that wrap this function. But the user can always call the base function and change the number of rounds for each message (if needed). Sounds ok?
- -- Med vänlig hälsning, Yours
Joachim Strömbergson - Alltid i harmonisk svängning. ======================================================================== Joachim Strömbergson Secworks AB joachim@secworks.se ========================================================================
Joachim Strömbergson joachim@secworks.se writes:
I haven't worked that much with Gitorios and esp not the one at Lysator. But Git and es Github supports good mechanism for an owner/integrator such as you to receive merge requests from downstream clones, test them as branches and then decided if to include them as a whole or cherry pick.
I've never really used the web stuff of either gitorious or github. If you put up a repo I can pull from, I'll most likely add it as a remote and fetch from it using the git command line tools.
I've used Gitorious before but really think Github is a much better service. This might be a stupid question but have you considered to move Nettle to Github? It would I assume get more exposure and probably more contributions. If that is what you want. ;-)
As a GNU maintainer, I'd like to avoid depending on things which are proprietary software, or software-as-a-service. But as long as github does plain git, I can fetch changes from github repos if the need arises.
I'll try and get them done for you hopefully this week. A lot of peeking at previous paytches in the maillist archive and stealint/borrowing from other code (salsa20) should help me out. If not I'll post on the list.
I look forward to that. I can't promise I'll be as quick with integration, but hopefully I'll get some hacking time during the Christmas break.
I can agree that the number of rounds is not really something to keep in the context. But I think that the solution of having the number of rounds fixed and having separate functions for the different versions is pretty ugly and inflexible.
That's a judgement, of course. In nettle, I don't think it makes sense to add flexibility for the sole purpose of convenient support for obscure algorithms or settings. If a wider range of rounds gets used in practice, we can reconsider this. With salsa 20, with only two variants (12 or 20 rounds) in use, I don't think separate functions gets too ugly.
I can see the need for applications to easily and dynamically change from 12 to 14 rounds by simply updating a variable, possibly even a loadable value instead of changing the code forcing a recompile.
That's not my experience. The typical crypto application has some configuration or protocol handshake to select between aes128-ctr, aes256-ctr, salsa20, salsa20r12, etc. For each choice, it needs some logic to allocate the right context struct and call the right functions (struct nettle_cipher is an example of a minimalistic framework to collect that info).
Then, from the application's point of view, salsa20r12 is as different from salsa20 as it is from aes128-ctr, none can be be substituted for the other except via the algorithm selection procedures, and it really doesn't matter which of the different algorithms are unified below that algorithm-selection framework.
And the particular change from 12 to 14 might add significant complexity to an optimized implementations with 4-way unrolling, so flexibility isn't always as cheap as it looks.
The salsa20 implementation does not come with different function calls for 128 and 256 bit keys. Instead the length is given as a parameter. I don't see the number of rounds being that much different.
The difference is that the key size matters *only* for the set_key function, there's no need to store it in the context struct.
And I think functions for specific would make a lot of sense also for salsa20. In Nettle, passing a key size as argument to the set_key functions used to be the norm, with a few exceptions for single key size algorithms like DES. But I'm now reconsidering that design. Today, I think a set_key function with a key_size parameter is appropriate only for algorithms which support more or less arbitrary keysizes. When only two or three discrete key sizes are specified, I prefer to view them as distict algorithms. In particular, I prefer that view if the keysetup really uses different logic for the different key sizes.
I suggest that the chacha_crypt function is accepting number of rounds as a parameter. You can then if you want add specific named functions that wrap this function.
That's a perfectly valid implementation choice to me. A similar function for sala20 has been discussed earlier. We might need to think a bit about naming. And the wrapper functions are necessary, for nettle_cipher if nothing else.
Regards, /Niels
You wrote:
I can see the need for applications to easily and dynamically change from 12 to 14 rounds by simply updating a variable, possibly even a loadable value instead of changing the code forcing a recompile.
That's not my experience. The typical crypto application has some configuration or protocol handshake to select between aes128-ctr, aes256-ctr, salsa20, salsa20r12, etc. For each choice, it needs some logic to allocate the right context struct and call the right functions (struct nettle_cipher is an example of a minimalistic framework to collect that info).
Then, from the application's point of view, salsa20r12 is as different from salsa20 as it is from aes128-ctr, none can be be substituted for the other except via the algorithm selection procedures, and it really doesn't matter which of the different algorithms are unified below that algorithm-selection framework.
And the particular change from 12 to 14 might add significant complexity to an optimized implementations with 4-way unrolling, so flexibility isn't always as cheap as it looks.
I agree -- please just add 20 rounds ChaCha initially.
I am beginning to think that the introducing the 12 rounds Salsa20 was a really bad/harmful outcome of eSTREAM, as it leads to all these discussions around number of rounds to support, which slows adoption. The speed of 20 rounds Salsa20 and ChaCha is high. If speed is more important than security for someone, I've heard of this nice cipher called ROT13 that we could probably implement very efficiently.
/Simon
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aloha!
Simon Josefsson wrote:
I agree -- please just add 20 rounds ChaCha initially.
I've added specific functions for 12 and 20 rounds.
I've heard of this nice cipher called ROT13 that we could probably implement very efficiently.
Now, now, lets try and stay focused on security here - everybody knows that you always should two rounds of ROT13. Please add rot13r2 patch.
- -- Med vänlig hälsning, Yours
Joachim Strömbergson - Alltid i harmonisk svängning. ======================================================================== Joachim Strömbergson Secworks AB joachim@secworks.se ========================================================================
"SJ" == Simon Josefsson simon@josefsson.org writes:
SJ> The speed of 20 rounds Salsa20 and ChaCha is high.
To give an example of how high, openssh's C implementation of chacha20 with poly1305 is faster than openssl's non-aesni amd64 assembly for aes128-gcm, and both significantly outperform ssh's use of openssl's aes128-ctr or -ccb assembly with openssh's umac-64.
On top of that, the sse2 assembly code for chacha20 at:
https://github.com/floodyberry/chacha-opt
is 3-4 times as fast as pure C, and the avx and avx2 assembly is about 50% faster still.
All for a cipher which is inherently easier securely to code than gcm and, like gcm, safer than most current usage of separate macs. (Given the various known attacks on TLS and the widely repeated statements that gcm is hard to code w/o timing leaks, et alia.)
-JimC -- James Cloos cloos@jhcloos.com OpenPGP: 1024D/ED7DAEA6
James Cloos cloos@jhcloos.com writes:
To give an example of how high, openssh's C implementation of chacha20 with poly1305 is faster than openssl's non-aesni amd64 assembly for aes128-gcm, and both significantly outperform ssh's use of openssl's aes128-ctr or -ccb assembly with openssh's umac-64.
Benchmarking nettle's implementation on my office machine (core i5),
algorithm cycles/byte salsa20 5.3 aes128 11 aes128 22 (openssl) arcfour 7.5 arcfour 3.75 (openssl)
(For aes, I'm surprised by the big difference to openssl. Nettle's aes assembly is pretty basic, and on this machine it seems to give a very marginal improvement over the C implementation, which runs at 12 cycles/byte. Maybe something is fishy with the ubuntu openssl package, or there's some problem with my benchmarking).
Anyway, getting back to chacha, it will be interesting to see how much faster chacha is than salsa20.
If I remember the chacha changes correctly, one gets rid of a permutation of the matrix, and I think some of the rotations in the round function (done as movaps, pslld, psrld, pxor) can be replaced by a pshufd. I think that can reduce the instruction count for the round function by 25-50%, depending on how many of the rotations can be replaced (there ought to be at least one rotation left with a rotation count which isn't a multiple of 8).
like gcm, safer than most current usage of separate macs.
Are you saying that chacha + poly1305 is not used in the obvious way as a stream cipher + a separate mac? Care to elaborate?
Regards, /Niels
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aloha!
Niels Möller wrote:
Benchmarking nettle's implementation on my office machine (core i5),
algorithm cycles/byte salsa20 5.3 aes128 11 aes128 22 (openssl) arcfour 7.5 arcfour 3.75 (openssl)
Side issue: Pretty big difference in performance also for arcfour.
Anyway, getting back to chacha, it will be interesting to see how much faster chacha is than salsa20.
DJB and some other benchmarks shows anything from zero to 30% better performance. The chacha paper states some ideas about the difference in parallelability.
If I remember the chacha changes correctly, one gets rid of a permutation of the matrix, and I think some of the rotations in the round function (done as movaps, pslld, psrld, pxor) can be replaced by a pshufd. I think that can reduce the instruction count for the round function by 25-50%, depending on how many of the rotations can be replaced (there ought to be at least one rotation left with a rotation count which isn't a multiple of 8).
The big difference is that you update the variables in a QR twice during the QR processing, but the QR is more regular and can easily (easier) be scheduled with fewer register active in a given cycle.
The DR processing is more regular to allow easier parallelism. The tight spot is between QR3 and QR4 where x15 is used in both. Otherwise it is really the 4 separate QRs in each half of the DR that provides parallelism.
This is why I got a bit curious when you Niels stated: "And the particular change from 12 to 14 might add significant complexity to an optimized implementations with 4-way unrolling"
If we constrain ourselves to an even number of rounds I have a bit of a problem to see how that would add significant complexity since we still will be doing the DR processing the same way. I guess I'm missing something, but I have spent some time doodling and thinking on the dependency constraints in ChaCha since I've done a HW implementation:
https://github.com/secworks/swchacha
The current implementation does only contain a single QR, but will be extended with support for 2 and 4 parallel QRs. There is a good paper [0] on HW implementation of Salsa20 and ChaCha that shows depencency within the QR. Looking at the clock frequency achieved one can clearly see when the dependency between QR3 and QR4 happens.
Oh, and in that paper Salsa20 is actually neck and neck with or slightly faster than ChaCha. ;-)
[0] L. Henzen, F. Carbognani, N. Felber, and W. Fichtner. VLSI Hardware Evaluation of the Stream Ciphers Salsa20 and ChaCha, and the Compression Function Rumba.
- -- Med vänlig hälsning, Yours
Joachim Strömbergson - Alltid i harmonisk svängning. ======================================================================== Joachim Strömbergson Secworks AB joachim@secworks.se ========================================================================
Joachim Strömbergson joachim@secworks.se writes:
The big difference is that you update the variables in a QR twice during the QR processing, but the QR is more regular and can easily (easier) be scheduled with fewer register active in a given cycle.
Sounds like I have to look closer at the chacha spec to understand the details.
This is why I got a bit curious when you Niels stated: "And the particular change from 12 to 14 might add significant complexity to an optimized implementations with 4-way unrolling"
There was no deep thought behind that comment. It's just that if an assembly loop is unrolled 4 times, it simplifies the code if you can assume that that the number of rounds you need is always divisible by 4.
Now, current salsa20 implementation don't do that, _salsa20_core seems to support any even and non-zero number, for both C, x86_64 and arm neon. And there's no obvious gain in doing more unrolling. Could possibly make more sense for chacha, if each round is shorter in terms of number of instructions, cycles, and dependencies.
Regards, /Niels
Joachim Strömbergson joachim@secworks.se writes:
Niels Möller wrote:
Benchmarking nettle's implementation on my office machine (core i5),
algorithm cycles/byte arcfour 7.5 arcfour 3.75 (openssl)
Side issue: Pretty big difference in performance also for arcfour.
Right, and this time in openssl's favour. I think that speed is quite impressive. I haven't written any arcfour assembly for x86_64, but I have tried earlied for x86 and sparc. It's a very serial loop doing one byte at a time. It's tempting to try to do two bytes at a time, but the easy way gives incorrect results when the i and j indices happen to collide.
One approach I played a bit with was to nevertheless do two bytes at a time, and then add some unlikely condition to detect collisions and fix them. But I couldn't manage to make that fast.
An easier trick is to generate 4 or eight bytes of the keystream at a time, collecting result in a register, so the xoring of the data can be done a word at a time. The sparc implementation does something along those lines, and at least does the data writes as aligned words.
But, I'd rather spend time on making salsa20 (and/or chacha) fast, than optimizing arcfour.
Regards, /Niels
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aloha!
Niels Möller wrote:
Right, and this time in openssl's favour. I think that speed is quite impressive. I haven't written any arcfour assembly for x86_64, but I have tried earlied for x86 and sparc. It's a very serial loop doing one byte at a time. It's tempting to try to do two bytes at a time, but the easy way gives incorrect results when the i and j indices happen to collide.
Yes, that is a problem. I've implemented RC4 in HW running at 2 cycles/byte but you end up dealing with collisions (and deep logic in combination with mem/reg lookup for state. And overlapping scheduling. And many memory ports.)
An easier trick is to generate 4 or eight bytes of the keystream at a time, collecting result in a register, so the xoring of the data can be done a word at a time. The sparc implementation does something along those lines, and at least does the data writes as aligned words.
Sounds like the best strategy, there really isn't much parallelism in RC4 and initialization is costly esp if one want to removed bias by throwing awat 256, 512, 768, 1024 etc bytes (depending in which suggested recommendation you want to follow.)
I tried looking at the OpenSSL ASM-code to see if one could to a simple fix to Nettle. Naive I admit.
- -- Med vänlig hälsning, Yours
Joachim Strömbergson - Alltid i harmonisk svängning. ======================================================================== Joachim Strömbergson Secworks AB joachim@secworks.se ========================================================================
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aloha!
James Cloos wrote:
Look Niels, it has number of rounds as part of the context! ;-)
(The usage and need is somewhat different though.)
- -- Med vänlig hälsning, Yours
Joachim Strömbergson - Alltid i harmonisk svängning. ======================================================================== Joachim Strömbergson Secworks AB joachim@secworks.se ========================================================================
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aloha!
Niels Möller wrote:
- A const struct nettle_cipher defined in chacha-meta.c, for each
important variant (number of rounds and key size)
Can't seem to find a similar meta file for salsa20. I'm looking at the the arcfour-meta and the twofish-meta where the latter seems easier to grokk.
- -- Med vänlig hälsning, Yours
Joachim Strömbergson - Alltid i harmonisk svängning. ======================================================================== Joachim Strömbergson Secworks AB joachim@secworks.se ========================================================================
Joachim Strömbergson joachim@secworks.se writes:
Can't seem to find a similar meta file for salsa20. I'm looking at the the arcfour-meta and the twofish-meta where the latter seems easier to grokk.
Ah, it's hidden in nettle-internal.c (which is used by testcases and benchmark only). That's because the nettle_cipher abstraction lacks a mechanism to set the iv, which is essential for salsa20.
I think we ought to introduce something slightly different than nettle_cipher for stream ciphers.
And if you find the _NETTLE_CIPHER* macros confusing, you don't need to use them.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se