I realized that I didn't have any use for the round-reduced Salsa20 stream functions. So I'll focus on what I needed instead, which resulted in a much smaller patch. What do you think? Open issues:
* I'm not particular happy about the name SALSA20_CORE_INOUT_SIZE.
* Where could this be documented in the manual? "Miscellaneous functions"? We could also create a new section "Non-cryptographic Hash functions". There is the FNV and CRC hashes that could be implemented under that umbrella.
/Simon
Simon Josefsson simon@josefsson.org writes:
I realized that I didn't have any use for the round-reduced Salsa20 stream functions. So I'll focus on what I needed instead, which resulted in a much smaller patch. What do you think?
Looks reasonable to me.
Open issues:
- I'm not particular happy about the name SALSA20_CORE_INOUT_SIZE.
Just use SALSA20_BLOCK_SIZE. It's the same thing, right?
- Where could this be documented in the manual? "Miscellaneous functions"? We could also create a new section "Non-cryptographic Hash functions".
I think it would make sense to put it in the salsa section, possibly with a reference from the section on hash functions and any other place where it may be relevant. As I understand it, it *is* like a cryptographic function, but with a fixed input size.
What about assembly implementation? Do you think that is desirable? I think it should be fairly easy. Start with x86_64/salsa20-crypt.asm, move the QROUND macro to some other file (and maybe define some additional macros for shared code). And write a new salsa20-core.asm, which can be pretty simple, in particular, it doesn't need the complex code for handling partial blocks.
+void +salsa20_core (unsigned rounds,
uint8_t dst[SALSA20_CORE_INOUT_SIZE],
const uint8_t src[SALSA20_CORE_INOUT_SIZE])
You settled on uint8_t input, rather than uint32_t? IIRC, the corresponding function in the salsa20 paper is defined with uint32_t array as input and an uint8_t array as output.
I guess little endian byte order of the input (matching the byte order for the output) is the only sensible choice? This ought to be clearly documented somewhere.
In nettle I currently don't use array types for function arguments. If you write them as arrays, like you do here, do current compilers make any use that for bounds checking, or do you just think it's clearer for humans?
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
- I'm not particular happy about the name SALSA20_CORE_INOUT_SIZE.
Just use SALSA20_BLOCK_SIZE. It's the same thing, right?
It is the same value at least. I'll use it.
- Where could this be documented in the manual? "Miscellaneous functions"? We could also create a new section "Non-cryptographic Hash functions".
I think it would make sense to put it in the salsa section, possibly with a reference from the section on hash functions and any other place where it may be relevant. As I understand it, it *is* like a cryptographic function, but with a fixed input size.
No, it is not a cryptographic hash function since it is not collision-resistant. Think of it as CRC or FNV.
What about assembly implementation? Do you think that is desirable? I think it should be fairly easy. Start with x86_64/salsa20-crypt.asm, move the QROUND macro to some other file (and maybe define some additional macros for shared code). And write a new salsa20-core.asm, which can be pretty simple, in particular, it doesn't need the complex code for handling partial blocks.
Right. I considered it, as a way to learn to assembler stuff in Nettle. However, I'm not sure there is a lot to gain, since there is no loop unrolling to speak off. But it could be considered.
+void +salsa20_core (unsigned rounds,
uint8_t dst[SALSA20_CORE_INOUT_SIZE],
const uint8_t src[SALSA20_CORE_INOUT_SIZE])
You settled on uint8_t input, rather than uint32_t? IIRC, the corresponding function in the salsa20 paper is defined with uint32_t array as input and an uint8_t array as output.
I guess little endian byte order of the input (matching the byte order for the output) is the only sensible choice? This ought to be clearly documented somewhere.
Yes. I'm struggling with this aspect. The only canonical resource I have been able to find on the Salsa20 core function is this page:
It says clearly:
The Salsa20 core is a function from 64-byte strings to 64-byte strings: the Salsa20 core reads a 64-byte string x and produces a 64-byte string Salsa20(x).
and
The 64-byte input x to Salsa20 is viewed in little-endian form as 16 words x0, x1, x2, ..., x15 in {0,1,...,2^32-1}. ... producing, in little-endian form, the 64-byte output Salsa20(x).
So little endian and byte input/output is clearly intended.
The page goes on to show code that takes and return uint32_t and does no endian conversion. It says that this is the callers responsibility, so we can assume this is intended to be included in the definition.
However, the Salsa20 stream cipher seems to use the core function with uint32_t inputs, and scrypt does too. So perhaps we should expose a uint32_t->uint32_t interface as well that does no endian conversion. It seems scrypt doesn't little endian convert inputs but expects the output uint32_t to be little endian converted though... I'll bring this up with Colin.
In nettle I currently don't use array types for function arguments. If you write them as arrays, like you do here, do current compilers make any use that for bounds checking, or do you just think it's clearer for humans?
I've always regarded it as something intended for humans. I'm not sure what a compiler could do with the information. Perhaps it could be useful for optimizations though.
/Simon
Simon Josefsson simon@josefsson.org writes:
However, the Salsa20 stream cipher seems to use the core function with uint32_t inputs, and scrypt does too. So perhaps we should expose a uint32_t->uint32_t interface as well that does no endian conversion. It seems scrypt doesn't little endian convert inputs but expects the output uint32_t to be little endian converted though... I'll bring this up with Colin.
Nope, I was mistaken.
Still it seems that an uint32_t->uint32_t interface without endian conversion may be beneficial for speed. Then scrypt won't have to convert from uint32_t to uint8_t just to have salsa convert from uint8_t to uint32_t back again, do the operation and convert it from uint32_t to uint8_t and have scrypt convert it from uint8_t to uint32_t. I haven't measured how much overhead endian conversion really consumes though.
Do you think a uint32_t->uint32_t interface would be acceptable? Externally the uint32 interface could be seen as a convenience function, although internally it would be the core and the uint8_t interface would just do endian conversion.
/Simon
Simon Josefsson simon@josefsson.org writes:
Do you think a uint32_t->uint32_t interface would be acceptable?
If that's what is really needed for scrypt (the only application so far), and it lets you avoid converting back and forth between bytes and uint32_t, then it seems like the right thing to do.
We may add a function with uint8_t (and little endian convention, I guess) later if the need arises. So we should take that possibility into account when naming the function.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Simon Josefsson simon@josefsson.org writes:
Do you think a uint32_t->uint32_t interface would be acceptable?
If that's what is really needed for scrypt (the only application so far), and it lets you avoid converting back and forth between bytes and uint32_t, then it seems like the right thing to do.
We may add a function with uint8_t (and little endian convention, I guess) later if the need arises. So we should take that possibility into account when naming the function.
So you don't think we should implement the uint8_t interface now? I think there is little cost in doing that directly, and allows us to directly use the only test vectors that I'm aware of.
What's a good name? Is there any precedent for something like this in Nettle?
void salsa20_core (unsigned rounds, uint8_t dst[SALSA20_BLOCK_SIZE], const uint8_t src[SALSA20_BLOCK_SIZE])
void salsa20_core32 (unsigned rounds, uint32_t dst[SALSA20_INPUT_LENGTH], const uint32_t src[SALSA20_INPUT_LENGTH])
/Simon
Simon Josefsson simon@josefsson.org writes:
So you don't think we should implement the uint8_t interface now? I think there is little cost in doing that directly, and allows us to directly use the only test vectors that I'm aware of.
I don't have a strong opinion. If it helps testing, and we are confident that it should be little-endian, let's add it.
What's a good name? Is there any precedent for something like this in Nettle?
If we look at internal building blocks, there are a couple of examples: gcm_hash, and sha1_compress (and similar compression functions for other hashes). Neither of those are advertised/documented.
I think it would make sense to call the uint32_t function salsa_core, and the uint8_t function salsa_hash. We'll see if we can sort out what properties it really has, but it definitely has important similarities to a hash function.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Simon Josefsson simon@josefsson.org writes:
So you don't think we should implement the uint8_t interface now? I think there is little cost in doing that directly, and allows us to directly use the only test vectors that I'm aware of.
I don't have a strong opinion. If it helps testing, and we are confident that it should be little-endian, let's add it.
What's a good name? Is there any precedent for something like this in Nettle?
If we look at internal building blocks, there are a couple of examples: gcm_hash, and sha1_compress (and similar compression functions for other hashes). Neither of those are advertised/documented.
I think it would make sense to call the uint32_t function salsa_core, and the uint8_t function salsa_hash. We'll see if we can sort out what properties it really has, but it definitely has important similarities to a hash function.
It is a hash function, just not a cryptographic hash function.
Below is an updated patch. What do you think?
/Simon
Simon Josefsson simon@josefsson.org writes:
It is a hash function, just not a cryptographic hash function.
Below is an updated patch. What do you think?
Looks pretty good. I'm not sure about naming.
Besides that, a few minor comments. Some of them boil downto simply inlining current _salsa20 into salsa20_core, and have salsa20_hash call salsa20_core rather than _salsa20. Do you see any drawbacks with that?
+static void +_salsa20 (unsigned rounds,
uint32_t *x)
No need to use _ in the name of a static-declared function.
- unsigned i;
- for (i = 0; i < rounds; i += 2)
- {
QROUND(x[0], x[4], x[8], x[12]);
QROUND(x[5], x[9], x[13], x[1]);
QROUND(x[10], x[14], x[2], x[6]);
QROUND(x[15], x[3], x[7], x[11]);
QROUND(x[0], x[1], x[2], x[3]);
QROUND(x[5], x[6], x[7], x[4]);
QROUND(x[10], x[11], x[8], x[9]);
QROUND(x[15], x[12], x[13], x[14]);
- }
Any reason not to do the final addition here? Would need separate input and output arguments, and be organized as
x = input (memcpy) do the rounds dst = input + x (loop)
We need a single temporary array somewhere if we want to allow in-place operation, and this may be that right place. And then we have implemented all of salsa20_core.
+void +salsa20_hash (unsigned rounds,
uint8_t *dst,
const uint8_t *src)
+{
- uint32_t x[SALSA20_INPUT_LENGTH];
- unsigned i;
- for (i = 0; i < SALSA20_INPUT_LENGTH; i++)
x[i] = LE_READ_UINT32(&src[i * 4]);
- _salsa20 (rounds, x);
- for (i = 0; i < SALSA20_INPUT_LENGTH; i++)
- {
uint32_t t = x[i] + LE_READ_UINT32(&src[i * 4]);
Seems unnecessary to convert (using LE_READ_UINT32) the same data twice.
+void +salsa20_core (unsigned rounds,
uint32_t *dst,
const uint32_t *src)
+{
- uint32_t x[SALSA20_INPUT_LENGTH];
- unsigned i;
- for (i = 0; i < SALSA20_INPUT_LENGTH; i++)
- x[i] = src[i];
This should be plain memcpy.
Regards, /Niels
Simon Josefsson simon@josefsson.org writes:
No, it is not a cryptographic hash function since it is not collision-resistant. Think of it as CRC or FNV.
I'd like to understand better the properties of Salsa. As I see it, we have a complicated non-linear, but easily invertible, permutation on blocks of 64 bytes (or 512 bits). Call this function P(X). Then we construct a hash function
F(X) = X xor P(X)
The XOR is intended to destroy invertibility both in theory (it's no longer a one-to-one map, or a permutation) and practice (given F(X), it's hard to find a preimage). Am I right so far, or is there some working trick or attack which finds a preimage?
But then I have no idea on the number and structure of the collisions, and on the difficulty of finding a pair of colliding inputs X != Y, F(X) = F(Y).
Right. I considered it, as a way to learn to assembler stuff in Nettle. However, I'm not sure there is a lot to gain, since there is no loop unrolling to speak off.
For salsa20 and x86_64, the main gain comes from using sse2 instructions to exploit the parallelism in the QROUND function.
However, the Salsa20 stream cipher seems to use the core function with uint32_t inputs,
Well, it does use little endian explicitly, when copying key material into the salsa core input.
and scrypt does too.
Reading the internet draft, it looks like it treats input and output as 64 bytes. I haven't read any other scrypt materials.
It seems scrypt doesn't little endian convert inputs but expects the output uint32_t to be little endian converted though... I'll bring this up with Colin.
Please do, this needs to be sorted out.
In nettle I currently don't use array types for function arguments.
I've always regarded it as something intended for humans. I'm not sure what a compiler could do with the information. Perhaps it could be useful for optimizations though.
I tend to use pointer notation, because (1) it's what really happens, (2) it's shorter to write, and (3) because arrays may give a casual reader the impression that something more complicated happens. Neither is a very good reason, but for now I think we should stick to pointers for consistency within Nettle. If arrays notation is better, we should change that in all places where we pass arrays of constant size.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Simon Josefsson simon@josefsson.org writes:
No, it is not a cryptographic hash function since it is not collision-resistant. Think of it as CRC or FNV.
I'd like to understand better the properties of Salsa. As I see it, we have a complicated non-linear, but easily invertible, permutation on blocks of 64 bytes (or 512 bits). Call this function P(X). Then we construct a hash function
F(X) = X xor P(X)
There is no XOR in the Salsa20 core?
and scrypt does too.
Reading the internet draft, it looks like it treats input and output as 64 bytes. I haven't read any other scrypt materials.
Yes, the draft treat it that way because the function appears to be defined that way. The scrypt paper does too (http://www.tarsnap.com/scrypt/scrypt.pdf).
It seems scrypt doesn't little endian convert inputs but expects the output uint32_t to be little endian converted though... I'll bring this up with Colin.
Please do, this needs to be sorted out.
I was mistaken, scrypt uses the uint32_t interface directly and (as epxected) without any endian conversion.
In nettle I currently don't use array types for function arguments.
I've always regarded it as something intended for humans. I'm not sure what a compiler could do with the information. Perhaps it could be useful for optimizations though.
I tend to use pointer notation, because (1) it's what really happens, (2) it's shorter to write, and (3) because arrays may give a casual reader the impression that something more complicated happens. Neither is a very good reason, but for now I think we should stick to pointers for consistency within Nettle. If arrays notation is better, we should change that in all places where we pass arrays of constant size.
No problem, I've made this change locally. Once we sort out the naming issue I'll submit another patch.
/Simon
Simon Josefsson simon@josefsson.org writes:
I'd like to understand better the properties of Salsa. As I see it, we have a complicated non-linear, but easily invertible, permutation on blocks of 64 bytes (or 512 bits). Call this function P(X). Then we construct a hash function
F(X) = X xor P(X)
There is no XOR in the Salsa20 core?
Sorry, the final operation is addition (applied independently to 32 bit pieces), not xor. But structure is the same, P(X) represents the QROUND loop, where each step is invertible, and then at the end we add together the output of the QROUNDs and the original input. Collisions are introduced in the final addition, in the sense that we have
P(x) == P(y) if and only if x = y
but F(x) = F(y) is possible for x != y. A collision means that
P(x) xor x == P(y) xor y
which can be rearrange, since P is invertible, as
x = P^{-1}(x xor y xor P(y)
Examples of such colliding pairs seem non-trivial to find (difficulty increasing wth the number of rounds).
You write that the function is not collision resistant. I'd like to know in which way it fails (ideally with either a simple argument, or a reference).
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Simon Josefsson simon@josefsson.org writes:
I'd like to understand better the properties of Salsa. As I see it, we have a complicated non-linear, but easily invertible, permutation on blocks of 64 bytes (or 512 bits). Call this function P(X). Then we construct a hash function
F(X) = X xor P(X)
There is no XOR in the Salsa20 core?
Sorry, the final operation is addition (applied independently to 32 bit pieces), not xor. But structure is the same, P(X) represents the QROUND loop, where each step is invertible, and then at the end we add together the output of the QROUNDs and the original input. Collisions are introduced in the final addition, in the sense that we have
P(x) == P(y) if and only if x = y
but F(x) = F(y) is possible for x != y. A collision means that
P(x) xor x == P(y) xor y
which can be rearrange, since P is invertible, as
x = P^{-1}(x xor y xor P(y)
Examples of such colliding pairs seem non-trivial to find (difficulty increasing wth the number of rounds).
You write that the function is not collision resistant. I'd like to know in which way it fails (ideally with either a simple argument, or a reference).
Try for example:
salsa20_core (8, dst, H("00000000000000000000000000000000" "00000000000000000000000000000000" "00000000000000000000000000000000" "00000000000000000000000000000000")); print_hex(64, dst);
salsa20_core (8, dst, H("00000080000000800000008000000080" "00000080000000800000008000000080" "00000080000000800000008000000080" "00000080000000800000008000000080")); print_hex(64, dst);
both will print the all-zero string, thus illustrating a collision. In fact, Salsa20core(x) = Salsa20core(x + c) for c = "0000000800000008...".
See also:
https://groups.google.com/forum/?fromgroups=#!msg/sci.crypt/AkQnSoO40BA/o4eG... http://cr.yp.to/snuffle/reoncore-20080224.pdf
/Simon
Simon Josefsson simon@josefsson.org writes:
See also:
https://groups.google.com/forum/?fromgroups=#!msg/sci.crypt/AkQnSoO40BA/o4eG... http://cr.yp.to/snuffle/reoncore-20080224.pdf
Thanks for the references. I'm now convinced we should avoid using the word "hash" here. I should revise the corresponding section of the Nettle manual as well.
Which leaves us with an unsolved naming problem... I don't quite like salsa_core and salsa_core32, but I have no better suggestions right now.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Simon Josefsson simon@josefsson.org writes:
See also:
https://groups.google.com/forum/?fromgroups=#!msg/sci.crypt/AkQnSoO40BA/o4eG... http://cr.yp.to/snuffle/reoncore-20080224.pdf
Thanks for the references. I'm now convinced we should avoid using the word "hash" here. I should revise the corresponding section of the Nettle manual as well.
Thanks -- I didn't notice that the text in the manual was wrong before. Fixing that would be good.
Which leaves us with an unsolved naming problem... I don't quite like salsa_core and salsa_core32, but I have no better suggestions right now.
I don't like it a lot either... I believe the uint8_t version should be called salsa20_core. The tricky name is the uint32_t variant. Ideas:
salsa_core32 salsa_core_32 salsa_core4 salsa_core_4 salsa_core_4byte salsa_core_word salsa_core_uint32 salsa_core_uint32
I don't either one of them is particulary good, so the choice is arbitrary. Updated patch below.
/Simon
Simon Josefsson simon@josefsson.org writes:
I don't like it a lot either... I believe the uint8_t version should be called salsa20_core.
Agreed.
The tricky name is the uint32_t variant. Ideas:
On this list, I think I'd prefer one of
salsa_core_word salsa_core_uint32
Another alternative is to consider the function internal, at least for the time being, and name it _salsa20_core. One motivation to keep it internal is that it's suboptimal to have a function processing only a single block, we may need to generalize it if we want implement other things, e.g., a proper cryptographic hash function, using this primitive.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Another alternative is to consider the function internal, at least for the time being, and name it _salsa20_core.
After some more thinking, I think this is the way to go. I'd like to propose the following plan:
* Do a _salsa20_core, working with uint32_t. Consider it an internal function, and keep the interface open (maybe it should be able to do several blocks, maybe it should byteswap output words, etc).
* Implement and document salsa20_core. It takes uint8_t blocks as input and output (together with key and round count), and calls _salsa20_core to do the work.
* Implement scrypt, in terms of _salsa20_core.
* Maybe have the C implementation salsa20_crypto also call _salsa20_core to do the main work. May require byteswapping in _salsa20_core output, to avoid a performance regression.
* Maybe do an x86_64 implementation of _salsa20_core (should be simpler than salsa20_crypt).
What do you think?
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
After some more thinking, I think this is the way to go. I'd like to propose the following plan:
- Do a _salsa20_core, working with uint32_t. Consider it an internal function, and keep the interface open (maybe it should be able to do several blocks, maybe it should byteswap output words, etc).
[...]
- Maybe have the C implementation salsa20_crypto also call _salsa20_core to do the main work. May require byteswapping in _salsa20_core output, to avoid a performance regression.
I tried extracting a _salsa20_core from salsa20_crypt. See patch below. The function byteswaps the output words (if needed, i.e., on bigendian machines).
On my machine (lowend, AMD E-350), the performance penalty is 4%, 12.25 cycles/byte before the change, 12.75 cycles/byte after. I think that's ok if it can be shared with salsa20_core and scrypt.
(And on this machine there also appears to be no significant gain from the current assembly implementation).
Regards, /Niels
diff --git a/Makefile.in b/Makefile.in index 9904be5..c0ca3ad 100644 --- a/Makefile.in +++ b/Makefile.in @@ -82,6 +82,7 @@ nettle_SOURCES = aes-decrypt-internal.c aes-decrypt.c \ md2.c md2-meta.c md4.c md4-meta.c \ md5.c md5-compress.c md5-compat.c md5-meta.c \ ripemd160.c ripemd160-compress.c ripemd160-meta.c \ + salsa20-core-internal.c \ salsa20-crypt.c salsa20-set-key.c \ sha1.c sha1-compress.c sha1-meta.c \ sha256.c sha256-compress.c sha224-meta.c sha256-meta.c \ diff --git a/salsa20-core-internal.c b/salsa20-core-internal.c new file mode 100644 index 0000000..2c1ae3c --- /dev/null +++ b/salsa20-core-internal.c @@ -0,0 +1,85 @@ +/* salsa20-core-internal.c + * + * Internal interface to the Salsa20 core function. + */ + +/* nettle, low-level cryptographics library + * + * Copyright (C) 2012 Simon Josefsson, Niels Möller + * + * The nettle library is free software; you can redistribute it and/or modify + * it under the terms of the GNU Lesser General Public License as published by + * the Free Software Foundation; either version 2.1 of the License, or (at your + * option) any later version. + * + * The nettle library is distributed in the hope that it will be useful, but + * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY + * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public + * License for more details. + * + * You should have received a copy of the GNU Lesser General Public License + * along with the nettle library; see the file COPYING.LIB. If not, write to + * the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, + * MA 02111-1301, USA. + */ + +/* Based on: + salsa20-ref.c version 20051118 + D. J. Bernstein + Public domain. +*/ + +#if HAVE_CONFIG_H +# include "config.h" +#endif + +#include <assert.h> +#include <string.h> + +#include "salsa20.h" + +#include "macros.h" + +#ifdef WORDS_BIGENDIAN +#define LE_SWAP32(v) \ + ((ROTL32(8, v) & 0x00FF00FFUL) | \ + (ROTL32(24, v) & 0xFF00FF00UL)) +#else +#define LE_SWAP32(v) (v) +#endif + +#define QROUND(x0, x1, x2, x3) do { \ + x1 ^= ROTL32(7, x0 + x3); \ + x2 ^= ROTL32(9, x1 + x0); \ + x3 ^= ROTL32(13, x2 + x1); \ + x0 ^= ROTL32(18, x3 + x2); \ + } while(0) + +void +_salsa20_core(uint32_t *dst, const uint32_t *src, unsigned rounds) +{ + uint32_t x[_SALSA20_INPUT_LENGTH]; + unsigned i; + + assert ( (rounds & 1) == 0); + + memcpy (x, src, sizeof(x)); + for (i = 0; i < rounds;i += 2) + { + QROUND(x[0], x[4], x[8], x[12]); + QROUND(x[5], x[9], x[13], x[1]); + QROUND(x[10], x[14], x[2], x[6]); + QROUND(x[15], x[3], x[7], x[11]); + + QROUND(x[0], x[1], x[2], x[3]); + QROUND(x[5], x[6], x[7], x[4]); + QROUND(x[10], x[11], x[8], x[9]); + QROUND(x[15], x[12], x[13], x[14]); + } + + for (i = 0; i < _SALSA20_INPUT_LENGTH; i++) + { + uint32_t t = x[i] + src[i]; + dst[i] = LE_SWAP32 (t); + } +} diff --git a/salsa20-crypt.c b/salsa20-crypt.c index eae3cea..b061b4b 100644 --- a/salsa20-crypt.c +++ b/salsa20-crypt.c @@ -40,21 +40,6 @@ #include "macros.h" #include "memxor.h"
-#ifdef WORDS_BIGENDIAN -#define LE_SWAP32(v) \ - ((ROTL32(8, v) & 0x00FF00FFUL) | \ - (ROTL32(24, v) & 0xFF00FF00UL)) -#else -#define LE_SWAP32(v) (v) -#endif - -#define QROUND(x0, x1, x2, x3) do { \ - x1 ^= ROTL32(7, x0 + x3); \ - x2 ^= ROTL32(9, x1 + x0); \ - x3 ^= ROTL32(13, x2 + x1); \ - x0 ^= ROTL32(18, x3 + x2); \ - } while(0) - void salsa20_crypt(struct salsa20_ctx *ctx, unsigned length, @@ -67,26 +52,8 @@ salsa20_crypt(struct salsa20_ctx *ctx, for (;;) { uint32_t x[_SALSA20_INPUT_LENGTH]; - int i; - memcpy (x, ctx->input, sizeof(x)); - for (i = 0;i < 10;i ++) - { - QROUND(x[0], x[4], x[8], x[12]); - QROUND(x[5], x[9], x[13], x[1]); - QROUND(x[10], x[14], x[2], x[6]); - QROUND(x[15], x[3], x[7], x[11]);
- QROUND(x[0], x[1], x[2], x[3]); - QROUND(x[5], x[6], x[7], x[4]); - QROUND(x[10], x[11], x[8], x[9]); - QROUND(x[15], x[12], x[13], x[14]); - } - - for (i = 0;i < _SALSA20_INPUT_LENGTH;++i) - { - uint32_t t = x[i] + ctx->input[i]; - x[i] = LE_SWAP32 (t); - } + _salsa20_core (x, ctx->input, 20);
ctx->input[9] += (++ctx->input[8] == 0);
diff --git a/salsa20.h b/salsa20.h index 7d47f52..d95d002 100644 --- a/salsa20.h +++ b/salsa20.h @@ -37,6 +37,7 @@ extern "C" { #define salsa20_set_key nettle_salsa20_set_key #define salsa20_set_iv nettle_salsa20_set_iv #define salsa20_crypt nettle_salsa20_crypt +#define _salsa20_core _nettle_salsa20_core
/* Minimum and maximum keysizes, and a reasonable default. In * octets.*/ @@ -75,6 +76,9 @@ salsa20_crypt(struct salsa20_ctx *ctx, unsigned length, uint8_t *dst, const uint8_t *src);
+void +_salsa20_core(uint32_t *dst, const uint32_t *src, unsigned rounds); + #ifdef __cplusplus } #endif
nisse@lysator.liu.se (Niels Möller) writes:
nisse@lysator.liu.se (Niels Möller) writes:
Another alternative is to consider the function internal, at least for the time being, and name it _salsa20_core.
After some more thinking, I think this is the way to go. I'd like to propose the following plan:
- Do a _salsa20_core, working with uint32_t. Consider it an internal function, and keep the interface open (maybe it should be able to do several blocks, maybe it should byteswap output words, etc).
Should that function really be declared in salsa20.h then?
- Implement and document salsa20_core. It takes uint8_t blocks as input and output (together with key and round count), and calls _salsa20_core to do the work.
I assume you didn't mean key here, since it is unkeyed.
- Implement scrypt, in terms of _salsa20_core.
Works for me.
- Maybe have the C implementation salsa20_crypto also call _salsa20_core to do the main work. May require byteswapping in _salsa20_core output, to avoid a performance regression.
Yes, this approach was used in an earlier patch.
- Maybe do an x86_64 implementation of _salsa20_core (should be simpler than salsa20_crypt).
Benchmarking it first might be good, I'm not sure you actually gain a lot here since there is no chained block operation like stream ciphers on bigger buffers.
I won't be able to do any more work on this until Monday though. I think we are essentially done though, so feel free to push things according to the plan above. Or I can put together something next week.
Thanks, /Simon
Simon Josefsson simon@josefsson.org writes:
nisse@lysator.liu.se (Niels Möller) writes:
- Do a _salsa20_core, working with uint32_t. Consider it an internal function, and keep the interface open (maybe it should be able to do several blocks, maybe it should byteswap output words, etc).
Should that function really be declared in salsa20.h then?
Other internal but exported functions are declared in public headers, so at least it's consistent.
- Implement and document salsa20_core. It takes uint8_t blocks as input and output (together with key and round count), and calls _salsa20_core to do the work.
I assume you didn't mean key here, since it is unkeyed.
You're right, of course.
- Maybe do an x86_64 implementation of _salsa20_core (should be simpler than salsa20_crypt).
Benchmarking it first might be good, I'm not sure you actually gain a lot here since there is no chained block operation like stream ciphers on bigger buffers.
Now benchmarking on my laptop, the C implementation takes 611 cycles (9.5 cycles / byte), with 20 rounds. I just tried an assembly implementation based on salsa20-crypt.asm. That takes 475 cycles (7.4 cycles / byte). Compared to salsa20_crypt, which currently takes 6.5 cycles / byte.
I think we are essentially done though, so feel free to push things according to the plan above. Or I can put together something next week.
I've pushed _salsa20_core now. Note that it does byteswapping of the output words, we'll see if that turns out to be good or bad for other applications of it.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se