nisse@lysator.liu.se (Niels Möller) writes:
Thanks for investigating. So from these charts, it looks like the single-block Neon code is of no benefit on any of the test systems. And even significantly slower on the tinkerboard and rpi4.
If that's right, the code should probably just be deleted. But I'll have to do a little benchmarking on my own before doing that.
I've done a benchmark run of nettle-3.6 on the GMP "nanot2" system, with a Cortex-A9 processor. The installed compiler is gcc-5.4 (a few years old). This is what I get:
nisse@nanot2:~/build$ nettle-nanot2-noasm/config.status --version nettle config.status 3.6 configured by /home/nisse/hack/nettle-3.6/configure, generated by GNU Autoconf 2.69, with options "'--disable-shared' '--disable-assembler'"
nisse@nanot2:~/build$ nettle-nanot2-noasm/examples/nettle-benchmark -f 1.4e9 salsa20
benchmark call overhead: 0.006500 us 9.10 cycles
Algorithm mode Mbyte/s cycles/byte cycles/block
salsa20 encrypt 78.52 17.00 1088.22 salsa20 decrypt 78.52 17.00 1088.22
salsa20r12 encrypt 111.62 11.96 765.57 salsa20r12 decrypt 111.62 11.96 765.57
nisse@nanot2:~/build$ nettle-nanot2-noasm/examples/nettle-benchmark -f 1.4e9 chacha
benchmark call overhead: 0.006500 us 9.10 cycles
Algorithm mode Mbyte/s cycles/byte cycles/block
chacha encrypt 66.21 20.17 1290.57 chacha decrypt 66.21 20.17 1290.57
-------------
nisse@nanot2:~/build$ nettle-nanot2-neon/config.status --version nettle config.status 3.6 configured by /home/nisse/hack/nettle-3.6/configure, generated by GNU Autoconf 2.69, with options "'--disable-shared' '--enable-arm-neon'"
nisse@nanot2:~/build$ nettle-nanot2-neon/examples/nettle-benchmark -f 1.4e9 salsa20
benchmark call overhead: 0.006450 us 9.03 cycles
Algorithm mode Mbyte/s cycles/byte cycles/block
salsa20 encrypt 74.41 17.94 1148.38 salsa20 decrypt 74.41 17.94 1148.38
salsa20r12 encrypt 113.56 11.76 752.44 salsa20r12 decrypt 113.56 11.76 752.44
nisse@nanot2:~/build$ nettle-nanot2-neon/examples/nettle-benchmark -f 1.4e9 chacha
benchmark call overhead: 0.006438 us 9.01 cycles
Algorithm mode Mbyte/s cycles/byte cycles/block
chacha encrypt 75.12 17.77 1137.44 chacha decrypt 75.12 17.77 1137.44
So no big differences, but the neon code improves performance slightly for chacha and sal20r12, and degrades performance sligtly for salsa20.
I had a quick look at the disassembly of the C implementations, and it uses a fair amount of loads and stores to the stack in the loop (since it has too few general purpose registers for the state to fit). But maybe it's well enough scheduled to do many instructions can be executed in parallel. To compare to the neon code, which does more work per instruction, but with dependencies forcing sequential execution of the instructions.
Regards, /Niels