Re: Release of Nettle-3.7?

1 Jan 2021

      Michael Weiser michael.weiser@gmx.de writes:
...
Happy new year, Niels and all around,
On Wed, Dec 30, 2020 at 09:12:24PM +0100, Niels Möller wrote:
...
...
It comes out at around seven cycles per block slowdown for chacha-3core
and five for salsa20-2core. I trace this to vst1.8. It's just slower
Thanks for investigating. Maybe keep some IF_BE / IF_LE just for the
store instructions, to stay with vstm on little-endian?
Sounds good. I'll try to finalise a patch and reconfirm that there's no
speed regression from it.
Sounds good!
...
With the help of Jeff I've gone on a bit of a benchmark binge using a:

Raspberry Pi 1B (Broadcom BCM2835, arm11),
Cubieboard2 (Allwinner A20, Cortex-A7),
Wandboard (Freescale i.MX6 DualLite, Cortex-A9),
Tinkerboard (Rockchip RK3288, Cortex-A17) and
Raspberry Pi 4 (Broadcom BCM2711, Cortex-A72).

The rpi1b doesn't do NEON, so there's no numbers for that. I booted the
rpi4 with Ubuntu 20.04 armhf with arm32 kernel and userland to avoid any
influence of switches from/to 64bit mode. Some other metrics of the
systems (such as compiler) and the build commands used are in the
attached result notes. The Debian and Ubuntu systems had cpufreq
activated. Since I didn't want to mess with that, I ran the benchmark
multiple times in a loop to get cpufreq to scale up.
I've put together a small script that parses the manual notes for
plotting using gnuplot. That produced the attached charts, which are
quite interesting.
Thanks for investigating. So from these charts, it looks like the
single-block Neon code is of no benefit on any of the test systems. And
even significantly slower on the tinkerboard and rpi4.
If that's right, the code should probably just be deleted. But I'll have
to do a little benchmarking on my own before doing that.
...
If these numbers are correct, it would seem that gcc got a *lot* better
in optimising for ARM in recent versions. And ARM seems to have
continuously improved native ARM instruction performance but NEON has
been stagnant.
Interesting.
...
What confuses me is that the arm, armv6 and neon routines all give
approximately the same speed. I'd have expected some visible difference
there. Maybe I'm still just doing something wrong here?
If you look specifically at salsa20 and chacha performance, there's no
arm or armv6 assembly, so arm, armv6 and noasm should all use the C
implementation. While neon will run different code (unless something is
highly messed up in the config).
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Release of Nettle-3.7?