Re: Release of Nettle-3.7?

30 Dec 2020


      Michael Weiser michael.weiser@gmx.de writes:
...
It comes out at around seven cycles per block slowdown for chacha-3core
and five for salsa20-2core. I trace this to vst1.8. It's just slower
than vstm (in contrast to vldm vs. vld1.32). I managed to save a
cumulative two cycles by rescheduling instructions so that there's no
two consecutive vst1.8s which seems to avoid stalls in the pipeline or
bus access waits (at least on my machine). Element width (8 vs. 32 vs.
64) doesn't seem to play into it.
Thanks for investigating. Maybe keep some IF_BE / IF_LE just for the
store instructions, to stay with vstm on little-endian?
...
(BTW: Am I using the benchmark correctly, particularly the frequency
parameter?)
I think it's right. But it's a floating point number, so -f 1e9 for 1
GHz should work too.
...
Alignment can't be used to improve
performance: The tests immediately bus error when giving a :64
alignment hint to vst1.8.
Unfortunately, I'm not aware of any nice and portable way to enforce
alignment from the calling C code.
...
Do you (or anybody else) have a hardware arm board for testing, possibly
with a Cortex A8 or A9 implementation to see how it behaves there?
I have access to the GMP test systems on
https://gmplib.org/devel/testsystems, but little time to benchmark
things in the near future.
...
I've got one side-track which might point to some peculiarity of my
machine: The unmodified assembler code *without* chacha-3core and
salsa20-2core (files moved out of the way before configure) is no faster
or even slower than what the C compiler produces:
[...]
...
Does this seem reasonable or does it point to some flaw in my
benchmarking or system software/hardware?
That's unexpected. In principle I guess it's possible for the C compiler
to generate great vectorized code, but that seems a bit unlikely. Do you
get the same results if you build Nettle-3.6?
...
From ChangeLog comments, it seems I got 45% speedup for Salsa20,
compared to the C implementation, when I wrote the original neon
assembly code. At the time, benchmarked on a pandaboard (cortex a9), if
I remember correctly.
Is it for a fat build? If so, it's possibly that the fat setup logic
selects the C implementation is this hacked setup (but on the other
hand, I'd guess a fat build may just failed at link time if these files
are removed).
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Release of Nettle-3.7?