Simon Josefsson simon@josefsson.org writes:
nisse@lysator.liu.se (Niels Möller) writes:
- Do a _salsa20_core, working with uint32_t. Consider it an internal function, and keep the interface open (maybe it should be able to do several blocks, maybe it should byteswap output words, etc).
Should that function really be declared in salsa20.h then?
Other internal but exported functions are declared in public headers, so at least it's consistent.
- Implement and document salsa20_core. It takes uint8_t blocks as input and output (together with key and round count), and calls _salsa20_core to do the work.
I assume you didn't mean key here, since it is unkeyed.
You're right, of course.
- Maybe do an x86_64 implementation of _salsa20_core (should be simpler than salsa20_crypt).
Benchmarking it first might be good, I'm not sure you actually gain a lot here since there is no chained block operation like stream ciphers on bigger buffers.
Now benchmarking on my laptop, the C implementation takes 611 cycles (9.5 cycles / byte), with 20 rounds. I just tried an assembly implementation based on salsa20-crypt.asm. That takes 475 cycles (7.4 cycles / byte). Compared to salsa20_crypt, which currently takes 6.5 cycles / byte.
I think we are essentially done though, so feel free to push things according to the plan above. Or I can put together something next week.
I've pushed _salsa20_core now. Note that it does byteswapping of the output words, we'll see if that turns out to be good or bad for other applications of it.
Regards, /Niels