I'm trying to learn a bit of ppc assembly. Below is an implementation of _chacha_core. Seems to work, when tested on gcc112.fsffrance.org (just put the file in the powerpc64 directory and reconfigure). This machine is little-endian, I haven't yet tested on big-endian.
Great work. The implementation looks fine, I like the idea of using -16 instead of 16 for rotating because vspltisw is limited to (-16 to 15) and vrlw picks the low-order 5 bits which is the same for both -16 and 16. BTW this implementation should work as is on big-endian mode without any hassle because lxvw4x/stxvw4x are endianness aware of loading/storing word values.
Unfortunately I don't get any accurate benchmark numbers on that machine, but I think speedup may be on the order of 50%. It could likely be speedup further by processing 2, 3 or 4 blocks in parallel, similar to recent improvements for arm and x86_64. I'd like to do that after the simpler single-block function is properly merged.
I can benchmark the optimized core but it could take me a few days to get it done, you may want to try Unicamp Minicloud https://openpower.ic.unicamp.br/minicloud or POWER Cloud at OSU http://osuosl.org/services/powerdev Unicamp Minicloud offer good POWER instances and would approve your request in two days.
I'm not sure where it fits under powerpc64. The code doesn't need any cryptographic extensions, but it depends on vector instructions as well as VSX registers (for the unaligned load and store instructions). So I'd need advice both on the directory hierarchy and compile time configuration, and appropriate runtime tests for fat builds.
The VSX instructions are introduced in Power ISA v.2.06 so since you have used VSX instructions lxvw4x/stxvw4x the minimum processor you are targeting is POWER7 We can add new config option like "--enable-power-vsx" that enable this optimization.