Joachim Strömbergson joachim@secworks.se writes:
Niels Möller wrote:
Benchmarking nettle's implementation on my office machine (core i5),
algorithm cycles/byte arcfour 7.5 arcfour 3.75 (openssl)
Side issue: Pretty big difference in performance also for arcfour.
Right, and this time in openssl's favour. I think that speed is quite impressive. I haven't written any arcfour assembly for x86_64, but I have tried earlied for x86 and sparc. It's a very serial loop doing one byte at a time. It's tempting to try to do two bytes at a time, but the easy way gives incorrect results when the i and j indices happen to collide.
One approach I played a bit with was to nevertheless do two bytes at a time, and then add some unlikely condition to detect collisions and fix them. But I couldn't manage to make that fast.
An easier trick is to generate 4 or eight bytes of the keystream at a time, collecting result in a register, so the xoring of the data can be done a word at a time. The sparc implementation does something along those lines, and at least does the data writes as aligned words.
But, I'd rather spend time on making salsa20 (and/or chacha) fast, than optimizing arcfour.
Regards, /Niels