I've now added some basic chacha x86_64 assembly. This gives a modest speedup over the code generated by gcc-4.7.2, about 8% in this machine. Apparently, gcc is pretty good at vectorizing this (and there seems to virtually no difference for salsa20).
I have one question, regarding the different rotation counts in chacha, including 16 and 8. I think I've read that this is supposed to be advantageous on x86_64, but after reviewing the various pshuf* instructions, it's not clear how. I now do these as left shith + right shift + or. Maybe the rotate by 16 bits can be done with pshufhw + pshuflw. Or am I missing some other way to do a rotate on an %xmm register?
Anyway, there are other ways to optimize chacha (and salsa20), by doing two blocks at a time (and I think we have enough registers for doing three chacha blocks at a time, if needed).
For simplicity, I'm considering to write an assembly _chacha_crypt function which supports only an integral number of blocks, and then let chacha_crypt handle a final partial block using _chacha_core + memxor instead. (Half of the current x86_64/salsa20-crypt.asm is the logic to store the final partial block, for questionable benefit). So if this works out well for chacha, the same could be done for salsa20.
Ah, and chacha seems to be about 15% faster than salsa20 Regards, /Niels