Hello Niels,
On Fri, Jan 01, 2021 at 06:07:14PM +0100, Niels Möller wrote:
With the help of Jeff I've gone on a bit of a benchmark binge using a:
- Raspberry Pi 1B (Broadcom BCM2835, arm11),
- Cubieboard2 (Allwinner A20, Cortex-A7),
- Wandboard (Freescale i.MX6 DualLite, Cortex-A9),
- Tinkerboard (Rockchip RK3288, Cortex-A17) and
- Raspberry Pi 4 (Broadcom BCM2711, Cortex-A72).
Thanks for investigating. So from these charts, it looks like the single-block Neon code is of no benefit on any of the test systems. And even significantly slower on the tinkerboard and rpi4.
Attached is the new patch that unconditionally switches from vldm to vld1.32 but keeps vstm in favour of vst1.8 on little-endian for stores.
I've done some additional benchmarks to verify impact on performance. I've used the wandboard, tinkerboard and rpi as before and cubieboard2s in little- and big-endian modes. This time I switched the cpufreq governor of the first three to "performance" to get more stable numbers (which helped noticeably, @Jeff: and switched back to ondemand on your boxes after). Also I did ten consecutive runs of benchmark and naively averaged the numbers (see attached raw data document). With another python script (attached) I created another chart using gnuplot[1].
This time I've normalised the numbers to percentages with unmodified master as reference to give a clearer indication for very small changes. So the first and fourth bar of each group (master and master-no23core) represent 100 percent for the following two bars respectively. The second bar (-unified) shows the values for the attached patch. The third bar (-unified-full) shows the values for the previous patch which unconditionally used vst1. -no23core again shows performance with chacha-2core and salsa-3core disabled.
The graph shows the expected slowdown when using vst1 for cubieboard and wandboard. The slowdown for the big-endian cubieboard (second cluster) can be ignored because the faster routines on unmodified master are broken. So the second and third bar just show the performance that needs to be sacrificed to get them working compared to LE.
On cubieboard, wandboard and tinkerboard there's still a small overhead from the switch to vld1.32 which was not reliably visible in my earlier benchmarks.
What's interesting is that on both tinkerboard and rpi4 there's also speedups from the switch to vld1.32 and even vst1.8 (the latter also on the wandboard but only for the likely irrelevant single core routines). So it seems, the performance penalty isn't set in stone and might differ between generations and implementations.
From that point of view, the slight performance hit for vld1.32 but
keeping of vstm on LE seems the best compromise, at least for the benchmarked set of machines.
Do you have any ideas how it might be that the wandboard, tinkerboard and rpi4 show speedups with vst1.8 for one algorithm but slowdowns for the other and even contradict each other in that? Does it make sense to dig into that some more or should we leave it be for now?
[1] t=$(mktemp) ; cat nettle-arm-bench-2.txt | python3 nettle-arm-bench-2.py >$t ; gnuplot -e "set term pngcairo font 'sans,9' size 960, 540; set style data histograms; set ylabel 'cycles/block'; set yrange [98:]; set xtics rotate out; set style fill solid border; set style histogram clustered; plot for [COL=2:7] '$t' using COL:xticlabels(1) title columnheader;" >nettle-arm-bench-chart-2.png ; rm -f "$t"
What confuses me is that the arm, armv6 and neon routines all give approximately the same speed. I'd have expected some visible difference
If you look specifically at salsa20 and chacha performance, there's no arm or armv6 assembly, so arm, armv6 and noasm should all use the C implementation. While neon will run different code (unless something is
Duh. So the slight differences were most likely due to the arm native assembly memxor routines.