Re: Release of Nettle-3.7?

1 Jan 2021


      Happy new year, Niels and all around,
On Wed, Dec 30, 2020 at 09:12:24PM +0100, Niels Möller wrote:
...
...
It comes out at around seven cycles per block slowdown for chacha-3core
and five for salsa20-2core. I trace this to vst1.8. It's just slower
Thanks for investigating. Maybe keep some IF_BE / IF_LE just for the
store instructions, to stay with vstm on little-endian?
Sounds good. I'll try to finalise a patch and reconfirm that there's no
speed regression from it.
...
...
Does this seem reasonable or does it point to some flaw in my
benchmarking or system software/hardware?
That's unexpected. In principle I guess it's possible for the C compiler
to generate great vectorized code, but that seems a bit unlikely. Do you
get the same results if you build Nettle-3.6?
With the help of Jeff I've gone on a bit of a benchmark binge using a:
- Raspberry Pi 1B (Broadcom BCM2835, arm11),
- Cubieboard2 (Allwinner A20, Cortex-A7),
- Wandboard (Freescale i.MX6 DualLite, Cortex-A9),
- Tinkerboard (Rockchip RK3288, Cortex-A17) and
- Raspberry Pi 4 (Broadcom BCM2711, Cortex-A72).
The rpi1b doesn't do NEON, so there's no numbers for that. I booted the
rpi4 with Ubuntu 20.04 armhf with arm32 kernel and userland to avoid any
influence of switches from/to 64bit mode. Some other metrics of the
systems (such as compiler) and the build commands used are in the
attached result notes. The Debian and Ubuntu systems had cpufreq
activated. Since I didn't want to mess with that, I ran the benchmark
multiple times in a loop to get cpufreq to scale up.
I've put together a small script that parses the manual notes for
plotting using gnuplot. That produced the attached charts, which are
quite interesting.
t=$(mktemp) ; cat nettle-arm-bench.txt | python3 nettle-arm-bench.py >$t ; gnuplot -e "set term pngcairo font 'sans,9' size 960, 540; set style data histograms; set ylabel 'cycles/block'; set xtics rotate out; set style fill solid border; set style histogram clustered; plot f or [COL=2:5] '$t' using COL:xticlabels(1) title columnheader;" >nettle-arm-bench-chart.png ; rm -f "$t"
...
...
From ChangeLog comments, it seems I got 45% speedup for Salsa20,
compared to the C implementation, when I wrote the original neon
assembly code. At the time, benchmarked on a pandaboard (cortex a9), if
I remember correctly.
I've disassembled an example of what the C compiler produces (I think
chacha-core-internal.o) and there were no NEON instructions in there. At
first glance it looked very similar to the armv6 assembler code.
BTW: The compilers default to their respective architecture, so would
produce armv5 code on the rpi1b and armv7 on tinkerboard/wandboard/
cubieboard2/rpi4.
If these numbers are correct, it would seem that gcc got a *lot* better
in optimising for ARM in recent versions. And ARM seems to have
continuously improved native ARM instruction performance but NEON has
been stagnant.
What confuses me is that the arm, armv6 and neon routines all give
approximately the same speed. I'd have expected some visible difference
there. Maybe I'm still just doing something wrong here?
At least the numbers rule out some peculiarity of the Cubieboards or my
Gentoo installation, IMO.
...
Is it for a fat build? If so, it's possibly that the fat setup logic
selects the C implementation is this hacked setup (but on the other
hand, I'd guess a fat build may just failed at link time if these files
are removed).
I did not enable fat for nettle 3.6 and explicitly disabled it for
master. I forced selection of specific routines using configure options.
-- 
Thanks,
Michael

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Release of Nettle-3.7?