Michael Weiser michael.weiser@gmx.de writes:
Happy new year, Niels and all around,
On Wed, Dec 30, 2020 at 09:12:24PM +0100, Niels Möller wrote:
It comes out at around seven cycles per block slowdown for chacha-3core and five for salsa20-2core. I trace this to vst1.8. It's just slower
Thanks for investigating. Maybe keep some IF_BE / IF_LE just for the store instructions, to stay with vstm on little-endian?
Sounds good. I'll try to finalise a patch and reconfirm that there's no speed regression from it.
Sounds good!
With the help of Jeff I've gone on a bit of a benchmark binge using a:
- Raspberry Pi 1B (Broadcom BCM2835, arm11),
- Cubieboard2 (Allwinner A20, Cortex-A7),
- Wandboard (Freescale i.MX6 DualLite, Cortex-A9),
- Tinkerboard (Rockchip RK3288, Cortex-A17) and
- Raspberry Pi 4 (Broadcom BCM2711, Cortex-A72).
The rpi1b doesn't do NEON, so there's no numbers for that. I booted the rpi4 with Ubuntu 20.04 armhf with arm32 kernel and userland to avoid any influence of switches from/to 64bit mode. Some other metrics of the systems (such as compiler) and the build commands used are in the attached result notes. The Debian and Ubuntu systems had cpufreq activated. Since I didn't want to mess with that, I ran the benchmark multiple times in a loop to get cpufreq to scale up.
I've put together a small script that parses the manual notes for plotting using gnuplot. That produced the attached charts, which are quite interesting.
Thanks for investigating. So from these charts, it looks like the single-block Neon code is of no benefit on any of the test systems. And even significantly slower on the tinkerboard and rpi4.
If that's right, the code should probably just be deleted. But I'll have to do a little benchmarking on my own before doing that.
If these numbers are correct, it would seem that gcc got a *lot* better in optimising for ARM in recent versions. And ARM seems to have continuously improved native ARM instruction performance but NEON has been stagnant.
Interesting.
What confuses me is that the arm, armv6 and neon routines all give approximately the same speed. I'd have expected some visible difference there. Maybe I'm still just doing something wrong here?
If you look specifically at salsa20 and chacha performance, there's no arm or armv6 assembly, so arm, armv6 and noasm should all use the C implementation. While neon will run different code (unless something is highly messed up in the config).
Regards, /Niels