Re: Release of Nettle-3.7?

29 Dec 2020


      Hello Niels,
On Fri, Dec 25, 2020 at 10:48:19PM +0100, Niels Möller wrote:
...
Since we have plenty of registers available, (including r3 which seems
unused and free to clobber), I'd suggest using
...
define(`SRCp32', `r3')
...
and an
...
add SRCp32, SRC, #32
...
in function entry, and then leave both SRC and SRCp32 unmodified for the
rest of the function.
I've done that and according to nettle-benchmark it saves one to two
cycles per block compared to the mov+postincrement approach.
...
...
As expected, all the special treatment of transposed operands can just
go away because it doesn't happen any more. Also, vld1.32 (for
sequential loads of 32-bit operands in host-endianness) and vld1.8 (for
sequential store of register contents to get an implicit little-endian
store without any vrev32.u8s) works the same on LE as well as BE.
Neat. Use of vld1.8 is worth a commment in the code (and/or arm/README).
I added those where it seemed to make sense. It was already in the
README but I've extended it a bit with the new findings.
...
...
Option 2: By coincidence I found that vldm/vstm can work with s
registers originally intended for use with VFP. They're just a different
That sounds a bit complicated, and since there's no great benefit over
vld1, maybe best to stay away from that?
Also, interestingly, when I use vldm to s regs wherever possible (see
second attached patch), it doesn't give any speedup. It saves the
scratch register in all routines I've touched, though. In general, it
seems that add+2*vld1.32 is exactly the same number of cycles as the
equivalent vldm.
...
...
Switch arm neon assemlber routines to endianness-agnostic loads and
stores where possible to avoid modifications to the rest of the code.
This involves switching to vld1.32 for loading consecutive 32-bit words
in host endianness as well as vst1.8 for storing back to memory in
little-endian order as required by the caller.
I like this approach. It would be nice if you coudl benchmark it on
little-endian, to verify that there's no unexpectedly large speed
regression (a regression of just cycle or two per block, if that's at
all measurable, is ok, I think).
It comes out at around seven cycles per block slowdown for chacha-3core
and five for salsa20-2core. I trace this to vst1.8. It's just slower
than vstm (in contrast to vldm vs. vld1.32). I managed to save a
cumulative two cycles by rescheduling instructions so that there's no
two consecutive vst1.8s which seems to avoid stalls in the pipeline or
bus access waits (at least on my machine). Element width (8 vs. 32 vs.
64) doesn't seem to play into it. Alignment can't be used to improve
performance: The tests immediately bus error when giving a :64
alignment hint to vst1.8.
Baseline with --disable-assembler comes in with these numbers on my
Cubieboard2 with 1GHz Allwinner A20 which is a Cortex-A7 implementation:
[michael@c2-le:~/nettle/build-noasm/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20
Algorithm         mode Mbyte/s cycles/byte cycles/block
chacha      encrypt   30.43       31.34      2005.82
            chacha      decrypt   30.41       31.36      2006.89
chacha_poly1305      encrypt   23.57       40.47      2589.77
   chacha_poly1305      decrypt   23.55       40.50      2592.15
   chacha_poly1305       update  104.42        9.13       584.51
salsa20      encrypt   35.10       27.17      1738.73
           salsa20      decrypt   35.10       27.17      1738.75
salsa20r12      encrypt   50.12       19.03      1217.75
        salsa20r12      decrypt   50.15       19.01      1216.93
(BTW: Am I using the benchmark correctly, particularly the frequency
parameter?)
Baseline unmodified assembler routines (without --enable-fat) come in
at:
[michael@c2-le:~/nettle/build-orig/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20
Algorithm         mode Mbyte/s cycles/byte cycles/block
chacha      encrypt   63.06       15.12       967.83
            chacha      decrypt   63.06       15.12       967.82
chacha_poly1305      encrypt   39.18       24.34      1557.72
   chacha_poly1305      decrypt   39.18       24.34      1557.96
   chacha_poly1305       update  104.38        9.14       584.75
salsa20      encrypt   62.15       15.34       982.04
           salsa20      decrypt   62.07       15.36       983.33
salsa20r12      encrypt   92.69       10.29       658.48
        salsa20r12      decrypt   92.70       10.29       658.43
Attached unified code (patch 0001) comes in like this:
[michael@c2-le:~/nettle/build-unified-add/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20
Algorithm         mode Mbyte/s cycles/byte cycles/block
chacha      encrypt   62.61       15.23       974.79
            chacha      decrypt   62.62       15.23       974.72
chacha_poly1305      encrypt   39.14       24.36      1559.28
   chacha_poly1305      decrypt   39.18       24.34      1558.00
   chacha_poly1305       update  103.65        9.20       588.88
salsa20      encrypt   61.80       15.43       987.65
           salsa20      decrypt   61.81       15.43       987.51
salsa20r12      encrypt   91.88       10.38       664.30
        salsa20r12      decrypt   91.91       10.38       664.07
What's nice is that the same code gives very consistent numbers on BE
(no idea what's going on with poly1305 though):
[michael@c2-be:~/nettle/build-unified-add/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20
Algorithm         mode Mbyte/s cycles/byte cycles/block
chacha      encrypt   62.56       15.25       975.69
            chacha      decrypt   62.62       15.23       974.68
chacha_poly1305      encrypt   38.40       24.83      1589.32
   chacha_poly1305      decrypt   38.40       24.83      1589.38
   chacha_poly1305       update   99.92        9.54       610.86
salsa20      encrypt   61.80       15.43       987.58
           salsa20      decrypt   61.81       15.43       987.41
salsa20r12      encrypt   91.90       10.38       664.14
        salsa20r12      decrypt   91.93       10.37       663.92
As said, the second patch (switching back to vldm via s regs where
possible) doesn't change these numbers at all (but saves a register).
(What's nice about my boards it that due to missing power-saving and
frequency-scaling functionality they give very, very consistent numbers
across multiple runs.)
My first reflex is that 400Kbyte/s for chacha and 350Kbyte/s for salsa20
is relevant enough to keep separate implementations for LE and BE in the
code *or* dig deeper into why vst1.8 is so much slower.
Do you (or anybody else) have a hardware arm board for testing, possibly
with a Cortex A8 or A9 implementation to see how it behaves there?
I have a couple of RasPis and little- and big-endian pine64s (aarch64)
gathering dust in a box which I could fire up for some testing (not sure
about 32-bit support on the pine64s, though).
...
and reuse SRCp32 for the second load of the same data, further down
(assuming r3 really is free to use for this purpose; if we have to save
I read AAPCS as saying that r3 can be used as scratch register inbetween
subroutine calls. Since we don't to subroutine calls, its use should be
fine.
I've got one side-track which might point to some peculiarity of my
machine: The unmodified assembler code *without* chacha-3core and
salsa20-2core (files moved out of the way before configure) is no faster
or even slower than what the C compiler produces:
[michael@c2-le:~/nettle/build-no23core/examples] LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1000000000 chacha salsa20
Algorithm         mode Mbyte/s cycles/byte cycles/block
chacha      encrypt   31.35       30.42      1946.66
            chacha      decrypt   31.34       30.43      1947.30
chacha_poly1305      encrypt   24.10       39.57      2532.24
   chacha_poly1305      decrypt   24.10       39.57      2532.21
   chacha_poly1305       update  104.42        9.13       584.53
salsa20      encrypt   30.38       31.39      2008.96
           salsa20      decrypt   30.39       31.38      2008.34
salsa20r12      encrypt   47.00       20.29      1298.56
        salsa20r12      decrypt   47.01       20.29      1298.25
Does this seem reasonable or does it point to some flaw in my
benchmarking or system software/hardware? (I've done my best using gdb
to verify that the asm routines are in use. Unfortunately,
nettle-benchmark is resisting attempts to ltrace or gdb-debug it, so I
diagnosed the testsuite tests instead.)
-- 
Thanks,
Michael

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Release of Nettle-3.7?