I've now added some basic chacha x86_64 assembly. This gives a modest speedup over the code generated by gcc-4.7.2, about 8% in this machine. Apparently, gcc is pretty good at vectorizing this (and there seems to virtually no difference for salsa20).
I have one question, regarding the different rotation counts in chacha, including 16 and 8. I think I've read that this is supposed to be advantageous on x86_64, but after reviewing the various pshuf* instructions, it's not clear how. I now do these as left shith + right shift + or. Maybe the rotate by 16 bits can be done with pshufhw + pshuflw. Or am I missing some other way to do a rotate on an %xmm register?
Anyway, there are other ways to optimize chacha (and salsa20), by doing two blocks at a time (and I think we have enough registers for doing three chacha blocks at a time, if needed).
For simplicity, I'm considering to write an assembly _chacha_crypt function which supports only an integral number of blocks, and then let chacha_crypt handle a final partial block using _chacha_core + memxor instead. (Half of the current x86_64/salsa20-crypt.asm is the logic to store the final partial block, for questionable benefit). So if this works out well for chacha, the same could be done for salsa20.
Ah, and chacha seems to be about 15% faster than salsa20 Regards, /Niels
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Aloha!
(Sorry for slow response to ChaCha stuff.)
Niels Möller wrote:
I've now added some basic chacha x86_64 assembly. This gives a modest speedup over the code generated by gcc-4.7.2, about 8% in this machine. Apparently, gcc is pretty good at vectorizing this (and there seems to virtually no difference for salsa20).
By vectorizing you mean running quarterrounds in parallel? You should be able to do at least four in parallel (if there are regs available). 8 requires pipelining. I've implemented ChaCha with four parallel QRs in HW:
https://github.com/secworks/swchacha
(Which is just anecdotal to this discussion.)
I have one question, regarding the different rotation counts in chacha, including 16 and 8. I think I've read that this is supposed to be advantageous on x86_64, but after reviewing the various pshuf* instructions, it's not clear how. I now do these as left shith + right shift + or. Maybe the rotate by 16 bits can be done with pshufhw + pshuflw. Or am I missing some other way to do a rotate on an %xmm register?
Have you looked at the asm code by DJB? He does up to four blocks in parallel and do some tricks with the shifts. xmm-5 should be relevant.
Ah, and chacha seems to be about 15% faster than salsa20
Which seems to match what DJB claims in the paper. Good.
- -- Med vänlig hälsning, Yours
Joachim Strömbergson - Alltid i harmonisk svängning. ======================================================================== Joachim Strömbergson Secworks AB joachim@secworks.se ========================================================================
Joachim Strömbergson joachim@secworks.se writes:
By vectorizing you mean running quarterrounds in parallel?
I mean putting several uint32_t values in a simd register, and using simd instructions.
Have you looked at the asm code by DJB?
Not really, I find the generated assembly pretty hard to read, and I haven't tried to understand his qhasm tool.
He does up to four blocks in parallel and do some tricks with the shifts. xmm-5 should be relevant.
To me, it looks like all rotates are done with psrld + pslld. But I might be missing something. On the few machines I have benchmarked the code (I haven't been very systematic), pshufhw + pshuflw seems to be slightly faster. It saves one por instruction.
I'm pretty sure doing a couple of blocks at a time in parellel, interleaving the instructions, will give some speedup.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se