nisse@lysator.liu.se (Niels Möller) writes:
For x86_64 (and maybe x86_32), assembly implementation also seems attractive,
I just push an x86_64 implementation of salsa20_crypt. Runs at 6.6 cycles/byte on my laptop (or 189 Mbyte/s), which is more than twice the speed of aes128, and slightly faster than arcfour.
It's likely possible to squeeze out a cycle or two more, by doing two blocks in parallel (I think djb's x86_64 code does that, but I found it very hard to read), or by other micro-optimizations.
Do any of you know of any protocols which specify use of salsa20? Is it usually combined with some *fast* MAC algorithm?
Regards, /Niels