Hi! Sorry for not following up on the Salsa20 thread, other things needed my attention... primarily my reason for saying this shouldn't be committed directly was that it used the first reference implementation from:
and that page contains other implementations as well, which may be faster. However speed can always be improved later on, I suppose the important aspect is to get the API right from the start.
I'm replying to all your replies in this e-mail, see below.
nisse@lysator.liu.se (Niels Möller) writes:
Simon Josefsson simon@josefsson.org writes:
Hi! Please find attached a port of DJB's public domain code for Salsa20 to nettle. The patch is not meant to be applied as-is but to start a discussion.
I'm not familiar with this cipher... Could you explain briefly why one may want to use it?
I don't know of any reason today, e.g., no important Internet applications uses it. It has some interesting properties though, so maybe there is uptake.
While writing this I noticed a sub-optimal design of Nettle, in that it overloads the meaning of struct nettle_cipher->block_size as a flag of whether the cipher is a stream cipher or not. Before arcfour was the only stream cipher in nettle, and using block_size=0 to signal that it is a stream cipher worked fine. However, Salsa20 has a block size of 8 bytes.
If Salsa20 doesn't fit in the nettle_cipher abstraction, then I think we should start by implementing Salsa20 *without* trying to get it to fit there. To me, a "stream cipher" with a block size and iv seems a bit like a contradiction.
Somewhat, although I think the distinction between stream ciphers and block ciphers is not a good one to make in an API. Usually applications want to use ciphers in some mode, and some modes used with block ciphers make them essentially stream ciphers. I think an good API should be modeled around "symmetric encryption" as the concept. Possibly just "encryption", but I think public-key and symmetric encryption are such different beasts that they shouldn't be exposed through the same API.
In general, I don't think it makes much sense to include algorithms which behave very differently under the "nettle_cipher" umbrella, since the point of that abstraction is to collect algorithms which are easily exchangable. Code accepting a nettle_cipher are not expected to check flags or behave fundamentally different for different algorithms.
Sure.
Including the arcfour stream cipher there may well be a mistake. One annoying problem with the current design is that for block ciphers, the ctx argument of the encrypt and decrypt functions naturally is const, but due arcfour being fittted in the same nettle_cipher abstraction, the function typedef nettle_crypt_func can't use a const ctx.
I always found rc4 the odd cipher in nettle.
Anyway, I'd prefer to treat redesign in that area as an independent problem from adding support for Salsa20.
That is a good idea.
Then there's also the tentative (nettle-internal.h) nettle_aead abstraction. Salsa20 could perhaps fit there, if we allow algorithms with no authentication (NULL digest function pointer).
Possibly... or just have one abstract "symmetric encryption" that embodies all these variants. Or does that lead to other disadvantages?
--- /dev/null +++ b/salsa20.c +#define ROTL32(x,n) ((((x))<<(n)) | (((x))>>(32-(n))))
There are several different variants of that macro. It would be nice with a unified one in macros.h.
I agree.
+#define SWAP32(v) \
- ((ROTL32(v, 8) & 0x00FF00FFUL) | \
- (ROTL32(v, 24) & 0xFF00FF00UL))
+#ifdef WORDS_BIGENDIAN +#define U32TO32_LITTLE(v) SWAP32(v) +#else +#define U32TO32_LITTLE(v) (v) +#endif
+#define U8TO32_LITTLE(p) U32TO32_LITTLE(((uint32_t*)(p))[0]) +#define U32TO8_LITTLE(p, v) (((uint32_t*)(p))[0] = U32TO32_LITTLE(v))
That's a clever byte swapping trick (at least if a true rot instruction is available). In Nettle conversion between bytes and integers are usually done with READ_UINT32, LE_READ_UINT32 and friends. It's usually not very performance critical, and it deals naturally with any alignment. I suspect U8TO32_LITTLE above breaks if the input is unaligned but the architecture doesn't allow unaligned word reads.
Using READ_UINT32 etc is probably better.
+/* +salsa20-ref.c version 20051118 +D. J. Bernstein +Public domain. +*/
Shouldn't that be moved up closer to the (C) header?
I put the majority of my changes above that header, to make it easy to sync and compare the code
+#define ROTATE(v,c) (ROTL32(v,c)) +#define XOR(v,w) ((v) ^ (w)) +#define PLUS(v,w) ((v) + (w)) +#define PLUSONE(v) (PLUS((v),1))
I'm not sure these macros improve readability...
They are from the reference code.
+static void salsa20_wordtobyte(uint8_t output[64],const uint32_t input[16])
Despite the name, this is the main encryption function. Or rather, a (hopefully) one-way function? "input" is really the context struct, which contains some mix of the key and the iv?
I must admit I haven't looked into the details of the code.
- x[ 4] = XOR(x[ 4],ROTATE(PLUS(x[ 0],x[12]), 7));
And what's wrong with writing that as
x[ 4] ^= ROTATE(x[ 0] + x[12], 7)
That's clearer to me.
+void +salsa20_set_iv(struct salsa20_ctx *ctx, unsigned length, const uint8_t *iv) +{
- assert (length == SALSA20_IV_SIZE);
And this initializes indices 6-9. I'd remove the length argument, if only one value is allowed anyway.
I added it only for compatibility with the internal nettle interface.
I've done some more homework, reading some of djb's papers and the wikipedia page. Some questions:
- Any protocols that specify use of salsa20? And in combination with which MAC? With claimed cycle numbers, hmac with sha1 or sha2 is going to take much more time than the encryption.
I don't know of any protocols right now.
- What about the later "ChaCha" cipher? As far as I see, it has not been studied as much as Salsa20.
Yeah, I looked at it too, but preferred something that people seems more familiar with.
A bit of cleanup of the code.
Try optimization of the C code for 64-bit machines. One ought to be able to do two column or row operations in parallel by putting two salsa20 words into a uint64_t variable. May need some tricks to avoid carry propagation between the words, but I suspect it may be a win due to lower register pressure. A bit similar to the HAVE_NATIVE_64_BIT in camellia-crypt-internal.c.
Try an sse2 assembly implementation (the djb:s papers outline how to do that). Or copy some existing implementation.
Take a look at the link above, most likely there exists something. I'm not sure how important it is though.
- Documentation.
Yes.
As for the salsa20_* interface:
- It would be possible to change the interface to not expose the block size, doing a little buffering instead. But I think it's better to not do that, and follow what the ctr code and gcm code does.
I agree.
- One advertised feature of the cipher is random access. I think we should have something like a salsa20_set_pos, taking a block count as argument.
Yes.
I also arranged in the Makefile so that nettle-internal.o no longer is included in the nettle library. Will that cause any trouble for anybody?
Didn't GnuTLS use some thought-to-be-internal-functions-but-apparently-not? Maybe it was just the macros.h header file though.
Do any of you know of any protocols which specify use of salsa20? Is it usually combined with some *fast* MAC algorithm?
I suspect people who like Salsa are inclined to also like CubeHash, which could be used in HMAC variants. CubeHash is fast with optimistic parameters, but the "default" is pretty conservative making it not so fast.
/Simon