nisse@lysator.liu.se (Niels Möller) writes:
And now I've added support for salsa20 benchmarking as well. Performance is pretty good, even for this reference implementation. Seems to run at 12 cycles per byte on my laptop, a little faster than 128-bit aes.
Now I've done some cleanup and reorganization work. Using memxor3 for the actual xoring saves almost 3 cycles per byte, so it's a considerable speedup in this case (and then memxor3 can maybe be sped up a bit more on this platform, from Nikos' work a while ago).
I wonder if there's any potential for simd-style parallel processing using plain 64-bit operations (no sse2 assembly). If one puts two 32-bit values in a 64-bit variable, parallel xor is trivial, but we also need 32-bit rotations and adds.
Parallel rotations are possible as two shifts and some extra masking, so it might be slightly faster than two 32-bit rotations of two shifts each.
For parallel mod 2^32 additions, one need to cancel one of the carries. Something like
uint64_t x, y, z;
z = x + y; z += (- (uint64_t) ((uint32_t) z < (uint32_t) y)) << 32;
This can hardly be faster than two 32-bit additions (two lea, cmp, cmov on x86_64, or alternatively, one lea, cmp, sbb, shl, add, depending on code genereration).
Any speedup depends on the relative speedup or slowdown of each primitive, the cost of permuting values as needed, and then the effects of reduced register pressure on top of that.
For x86_64 (and maybe x86_32), assembly implementation also seems attractive, but I'd like to do any general or 64-bit specific optimizations in the C code first.
Regards, /Niels