Re: Release of Nettle-3.7?

1 Jan 2021


      nisse@lysator.liu.se (Niels Möller) writes:
...
Thanks for investigating. So from these charts, it looks like the
single-block Neon code is of no benefit on any of the test systems. And
even significantly slower on the tinkerboard and rpi4.
If that's right, the code should probably just be deleted. But I'll have
to do a little benchmarking on my own before doing that.
I've done a benchmark run of nettle-3.6 on the GMP "nanot2" system, with
a Cortex-A9 processor. The installed compiler is gcc-5.4 (a few years
old). This is what I get:
nisse@nanot2:~/build$ nettle-nanot2-noasm/config.status --version
nettle config.status 3.6
configured by /home/nisse/hack/nettle-3.6/configure, generated by GNU Autoconf 2.69,
  with options "'--disable-shared' '--disable-assembler'"
nisse@nanot2:~/build$ nettle-nanot2-noasm/examples/nettle-benchmark -f
1.4e9 salsa20
benchmark call overhead: 0.006500 us   9.10 cycles
Algorithm         mode Mbyte/s cycles/byte cycles/block
salsa20      encrypt   78.52       17.00      1088.22
           salsa20      decrypt   78.52       17.00      1088.22
salsa20r12      encrypt  111.62       11.96       765.57
        salsa20r12      decrypt  111.62       11.96       765.57
nisse@nanot2:~/build$ nettle-nanot2-noasm/examples/nettle-benchmark -f 1.4e9 chacha
benchmark call overhead: 0.006500 us   9.10 cycles
Algorithm         mode Mbyte/s cycles/byte cycles/block
chacha      encrypt   66.21       20.17      1290.57
            chacha      decrypt   66.21       20.17      1290.57
-------------
nisse@nanot2:~/build$ nettle-nanot2-neon/config.status --version
nettle config.status 3.6
configured by /home/nisse/hack/nettle-3.6/configure, generated by GNU Autoconf 2.69,
  with options "'--disable-shared' '--enable-arm-neon'"
nisse@nanot2:~/build$ nettle-nanot2-neon/examples/nettle-benchmark -f 1.4e9 salsa20
benchmark call overhead: 0.006450 us   9.03 cycles
Algorithm         mode Mbyte/s cycles/byte cycles/block
salsa20      encrypt   74.41       17.94      1148.38
           salsa20      decrypt   74.41       17.94      1148.38
salsa20r12      encrypt  113.56       11.76       752.44
        salsa20r12      decrypt  113.56       11.76       752.44
nisse@nanot2:~/build$ nettle-nanot2-neon/examples/nettle-benchmark -f 1.4e9 chacha
benchmark call overhead: 0.006438 us   9.01 cycles
Algorithm         mode Mbyte/s cycles/byte cycles/block
chacha      encrypt   75.12       17.77      1137.44
            chacha      decrypt   75.12       17.77      1137.44
So no big differences, but the neon code improves performance slightly
for chacha and sal20r12, and degrades performance sligtly for salsa20.
I had a quick look at the disassembly of the C implementations, and it
uses a fair amount of loads and stores to the stack in the loop (since
it has too few general purpose registers for the state to fit). But
maybe it's well enough scheduled to do many instructions can be executed
in parallel. To compare to the neon code, which does more work per
instruction, but with dependencies forcing sequential execution of the
instructions.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Release of Nettle-3.7?