Re: Release of Nettle-3.7?

3 Jan 2021

      Hello Niels,
On Fri, Jan 01, 2021 at 06:07:14PM +0100, Niels Möller wrote:
...
...
With the help of Jeff I've gone on a bit of a benchmark binge using a:

Raspberry Pi 1B (Broadcom BCM2835, arm11),
Cubieboard2 (Allwinner A20, Cortex-A7),
Wandboard (Freescale i.MX6 DualLite, Cortex-A9),
Tinkerboard (Rockchip RK3288, Cortex-A17) and
Raspberry Pi 4 (Broadcom BCM2711, Cortex-A72).

Thanks for investigating. So from these charts, it looks like the
single-block Neon code is of no benefit on any of the test systems. And
even significantly slower on the tinkerboard and rpi4.
Attached is the new patch that unconditionally switches from vldm to vld1.32 but
keeps vstm in favour of vst1.8 on little-endian for stores.
I've done some additional benchmarks to verify impact on performance.
I've used the wandboard, tinkerboard and rpi as before and cubieboard2s
in little- and big-endian modes. This time I switched the cpufreq
governor of the first three to "performance" to get more stable numbers
(which helped noticeably, @Jeff: and switched back to ondemand on your
boxes after). Also I did ten consecutive runs of benchmark and naively
averaged the numbers (see attached raw data document). With another
python script (attached) I created another chart using gnuplot[1].
This time I've normalised the numbers to percentages with unmodified
master as reference to give a clearer indication for very small changes.
So the first and fourth bar of each group (master and master-no23core)
represent 100 percent for the following two bars respectively. The
second bar (-unified) shows the values for the attached patch. The third
bar (-unified-full) shows the values for the previous patch which
unconditionally used vst1. -no23core again shows performance with
chacha-2core and salsa-3core disabled.
The graph shows the expected slowdown when using vst1 for cubieboard and
wandboard. The slowdown for the big-endian cubieboard (second cluster)
can be ignored because the faster routines on unmodified master are
broken. So the second and third bar just show the performance that needs
to be sacrificed to get them working compared to LE.
On cubieboard, wandboard and tinkerboard there's still a small overhead
from the switch to vld1.32 which was not reliably visible in my
earlier benchmarks.
What's interesting is that on both tinkerboard and rpi4 there's also
speedups from the switch to vld1.32 and even vst1.8 (the latter also on
the wandboard but only for the likely irrelevant single core routines).
So it seems, the performance penalty isn't set in stone and might differ
between generations and implementations.
...
From that point of view, the slight performance hit for vld1.32 but
keeping of vstm on LE seems the best compromise, at least for the
benchmarked set of machines.
Do you have any ideas how it might be that the wandboard, tinkerboard
and rpi4 show speedups with vst1.8 for one algorithm but slowdowns for
the other and even contradict each other in that? Does it make sense to
dig into that some more or should we leave it be for now?
[1] t=$(mktemp) ; cat nettle-arm-bench-2.txt | python3 nettle-arm-bench-2.py >$t ; gnuplot -e "set term pngcairo font 'sans,9' size 960, 540; set style data histograms; set ylabel 'cycles/block'; set yrange [98:]; set xtics rotate out; set style fill solid border; set style histogram clustered; plot for [COL=2:7] '$t' using COL:xticlabels(1) title columnheader;" >nettle-arm-bench-chart-2.png ; rm -f "$t"
...
...
What confuses me is that the arm, armv6 and neon routines all give
approximately the same speed. I'd have expected some visible difference
If you look specifically at salsa20 and chacha performance, there's no
arm or armv6 assembly, so arm, armv6 and noasm should all use the C
implementation. While neon will run different code (unless something is
Duh. So the slight differences were most likely due to the arm native
assembly memxor routines.
-- 
Thanks,
Michael

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Release of Nettle-3.7?