Simon Josefsson simon@josefsson.org writes:
Actually, sleeping on this, I realized that we really want to export the Salsa20 core primitive (this was what I actually needed), and that is the primitive that should be implemented in assembler. I've fixed this in the attached patch.
The Salsa20 core is a hash function (not your typical hash function though) described here:
I guess it could be named salsa20_hash, then? (I think there was such a function in a previous version of the code).
If we implement that quickly in assembler, with a variable round parameter, that will be sufficient to build fast C code around.
Then you'd first write the hash output to memory, then read it back to xor it with the message. Since sals20 is pretty fast, I think you'll get a measurablle performance penalty compared to the currrent code which keeps the hash output in registers until it is xored to the message. You really need to get just the hash output, without xoring it to anything?
It would definitely be cleaner to have the hash function separately.
+salsa20_core (uint32_t src[_SALSA20_INPUT_LENGTH],
uint32_t dst[_SALSA20_INPUT_LENGTH],
unsigned rounds)
[...]
- for (i = 0;i < _SALSA20_INPUT_LENGTH;++i)
- {
uint32_t t = x[i] + src[i];
dst[i] = LE_SWAP32 (t);
- }
+}
This makes for a very peculiar interface for a non-internal function. It would make more sense from an interface perspectivve to either not do these byte swaps, or have the output parameter be of type uint8_t *. Or do something like the union gcm_block in gcm.h (although that's also not pretty), if we want to be able to store the byte swapped value with a word-sized store.
I don't remember precisely the background of the current implementation, but I think the point was to do as much as possible of the processing as word operations, including the byte swapping.
Regards, /Niels