nisse@lysator.liu.se (Niels Möller) writes:
Simon Josefsson simon@josefsson.org writes:
Actually, sleeping on this, I realized that we really want to export the Salsa20 core primitive (this was what I actually needed), and that is the primitive that should be implemented in assembler. I've fixed this in the attached patch.
The Salsa20 core is a hash function (not your typical hash function though) described here:
I guess it could be named salsa20_hash, then? (I think there was such a function in a previous version of the code).
The name of the hash is "Salsa20 core" but I think little effort has gone into tightening up the documentation around the Salsa20 hash (for example, there are no test vectors that I could find). salsa20_hash works for me, but could be confusing as it isn't a normal hash.
If we implement that quickly in assembler, with a variable round parameter, that will be sufficient to build fast C code around.
Then you'd first write the hash output to memory, then read it back to xor it with the message. Since sals20 is pretty fast, I think you'll get a measurablle performance penalty compared to the currrent code which keeps the hash output in registers until it is xored to the message.
Right, good point.
You really need to get just the hash output, without xoring it to anything?
Yes, although if necessary I could xor it to a zero buffer if there were no other way... however I'll loose performance, and my application (scrypt) would benefit from good performance.
It would definitely be cleaner to have the hash function separately.
I agree.
+salsa20_core (uint32_t src[_SALSA20_INPUT_LENGTH],
uint32_t dst[_SALSA20_INPUT_LENGTH],
unsigned rounds)
[...]
- for (i = 0;i < _SALSA20_INPUT_LENGTH;++i)
- {
uint32_t t = x[i] + src[i];
dst[i] = LE_SWAP32 (t);
- }
+}
This makes for a very peculiar interface for a non-internal function. It would make more sense from an interface perspectivve to either not do these byte swaps, or have the output parameter be of type uint8_t *. Or do something like the union gcm_block in gcm.h (although that's also not pretty), if we want to be able to store the byte swapped value with a word-sized store.
Let's use uint8_t. The first sentence of the Salsa20 core webpage is:
The Salsa20 core is a function from 64-byte strings to 64-byte strings: the Salsa20 core reads a 64-byte string x and produces a 64-byte string Salsa20(x).
So that is consistent with uint8_t.
I don't remember precisely the background of the current implementation, but I think the point was to do as much as possible of the processing as word operations, including the byte swapping.
Yes that will be faster I suppose.
/Simon