Hi, I tried out a microoptimization of the arm neon implementation of chacha and salsa20. Gave a 10% speedup on the older Cortex-A5 core, but unclear if it's an improvement overall, so I don't want to push it to master, and I've removed that commit from master-updates (now on its own branch arm-salsa20-chacha-vsra instead, in case anyone is curious). I'm considering changing the internal _salsa20_core and _chacha_core to do more than one block at a time, since processing a few blocks in parallel has a great potential for performance improvements.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se