I've now added some basic chacha x86_64 assembly. This gives a modest
speedup over the code generated by gcc-4.7.2, about 8% in this machine.
Apparently, gcc is pretty good at vectorizing this (and there seems to
virtually no difference for salsa20).
I have one question, regarding the different rotation counts in chacha,
including 16 and 8. I think I've read that this is supposed to be
advantageous on x86_64, but after reviewing the various pshuf*
instructions, it's not clear how. I now do these as left shith + right
shift + or. Maybe the rotate by 16 bits can be done with pshufhw +
pshuflw. Or am I missing some other way to do a rotate on an %xmm
register?
Anyway, there are other ways to optimize chacha (and salsa20), by doing
two blocks at a time (and I think we have enough registers for doing
three chacha blocks at a time, if needed).
For simplicity, I'm considering to write an assembly _chacha_crypt
function which supports only an integral number of blocks, and then let
chacha_crypt handle a final partial block using _chacha_core + memxor
instead. (Half of the current x86_64/salsa20-crypt.asm is the logic to
store the final partial block, for questionable benefit). So if this
works out well for chacha, the same could be done for salsa20.
Ah, and chacha seems to be about 15% faster than salsa20
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.