I've written some new ARM Neon assembly for salsa20. See https://gitlab.com/gnutls/nettle/-/commit/2ac58a1ce729a6cfe1d3703f4deb6da886..., when configured with --enable-arm-neon.
It interleaves the processing of two blocks, which gives a speedup of 50% -- 100% on the ARM cores where I've tested it. Before merging, I need to fix fat builds to use the new code on processors that support it.
To make it work also on big-endian ARM, I'd need some help. (I think the qemu-user package supports big-endian ARM, at least, it includes a program named qemu-armeb. But I'm missing a cross compiler and cross debugger).
I'd like to do the same for x86_64. And for chacha, it might give even greater speedup to interleave processing of three blocks, which may be possible since I think chacha needs fewer registers for temporaries.
For both x86_64 and ARM neon, the current code uses 128-bit wide registers. Processors with 256-bit wide simd registers (at least 16 of them) could do twice as many blocks at a time.
Regards, /Niels