Happy new year, Niels and all around,
On Wed, Dec 30, 2020 at 09:12:24PM +0100, Niels Möller wrote:
It comes out at around seven cycles per block slowdown for chacha-3core and five for salsa20-2core. I trace this to vst1.8. It's just slower
Thanks for investigating. Maybe keep some IF_BE / IF_LE just for the store instructions, to stay with vstm on little-endian?
Sounds good. I'll try to finalise a patch and reconfirm that there's no speed regression from it.
Does this seem reasonable or does it point to some flaw in my benchmarking or system software/hardware?
That's unexpected. In principle I guess it's possible for the C compiler to generate great vectorized code, but that seems a bit unlikely. Do you get the same results if you build Nettle-3.6?
With the help of Jeff I've gone on a bit of a benchmark binge using a:
- Raspberry Pi 1B (Broadcom BCM2835, arm11), - Cubieboard2 (Allwinner A20, Cortex-A7), - Wandboard (Freescale i.MX6 DualLite, Cortex-A9), - Tinkerboard (Rockchip RK3288, Cortex-A17) and - Raspberry Pi 4 (Broadcom BCM2711, Cortex-A72).
The rpi1b doesn't do NEON, so there's no numbers for that. I booted the rpi4 with Ubuntu 20.04 armhf with arm32 kernel and userland to avoid any influence of switches from/to 64bit mode. Some other metrics of the systems (such as compiler) and the build commands used are in the attached result notes. The Debian and Ubuntu systems had cpufreq activated. Since I didn't want to mess with that, I ran the benchmark multiple times in a loop to get cpufreq to scale up.
I've put together a small script that parses the manual notes for plotting using gnuplot. That produced the attached charts, which are quite interesting.
t=$(mktemp) ; cat nettle-arm-bench.txt | python3 nettle-arm-bench.py >$t ; gnuplot -e "set term pngcairo font 'sans,9' size 960, 540; set style data histograms; set ylabel 'cycles/block'; set xtics rotate out; set style fill solid border; set style histogram clustered; plot f or [COL=2:5] '$t' using COL:xticlabels(1) title columnheader;" >nettle-arm-bench-chart.png ; rm -f "$t"
From ChangeLog comments, it seems I got 45% speedup for Salsa20,
compared to the C implementation, when I wrote the original neon assembly code. At the time, benchmarked on a pandaboard (cortex a9), if I remember correctly.
I've disassembled an example of what the C compiler produces (I think chacha-core-internal.o) and there were no NEON instructions in there. At first glance it looked very similar to the armv6 assembler code.
BTW: The compilers default to their respective architecture, so would produce armv5 code on the rpi1b and armv7 on tinkerboard/wandboard/ cubieboard2/rpi4.
If these numbers are correct, it would seem that gcc got a *lot* better in optimising for ARM in recent versions. And ARM seems to have continuously improved native ARM instruction performance but NEON has been stagnant.
What confuses me is that the arm, armv6 and neon routines all give approximately the same speed. I'd have expected some visible difference there. Maybe I'm still just doing something wrong here?
At least the numbers rule out some peculiarity of the Cubieboards or my Gentoo installation, IMO.
Is it for a fat build? If so, it's possibly that the fat setup logic selects the C implementation is this hacked setup (but on the other hand, I'd guess a fat build may just failed at link time if these files are removed).
I did not enable fat for nettle 3.6 and explicitly disabled it for master. I forced selection of specific routines using configure options.