Re: Release of Nettle-3.7?

21 Dec 2020


      Hello Niels,
On Sat, Dec 19, 2020 at 09:51:45AM +0100, Niels Möller wrote:
...
...
Porting over the basic
IF_[LB]E mechanism from chacha-core-internal was easy and fixed up the
first of the three interleaved blocks right away. For the other two I am
still in the process of wrapping my head around how the interleaving
works and how it would need some adjustment for BE.
The 3-way functions don't do anything fancy, just each of the three
blocks represented in separate registers, and same instruction sequence
as for the 1-way version, duplicated threee times and interleaved.
I've got the tests passing for chacha now. Apart from the
straightforward porting-over of the BE shift and reverse-on-store logic
from chacha20-core-internal.asm special treatment is necessary for the
part of the state that's treated as a 64-bit counter. The two 32-bit
words it's comprised of are in host-endianness but consecutive order. So
they get reversed by the BE load. This is actually the case for all
32-bit operands throughout the routine on BE (and for
chacha-core-internal also) and cancels itself out on the final store.
But for the 64-bit counter it needs to be taken into account for the
addition to produce correct results.
See the attached patch for my current approach to fixing it, which is
explicit transposing, adding and then transposing again to be as
transposed as the other operands. I wonder if the surrounding C code
could be changed to supply that part of the state as a 64-bit doubleword
in host endianness to the assembler routine to cut down on adjustment.
Alternatively, could the 64-bit operation be broken down into two 32-bit
operations which implicitly adjust to the transposed 32-bit words on BE?
...
The 2-way version (for ARM, that's salsa only) tries to be a bit more
clever, with registers representing either odd or even words from both
blocks.
For a start this also needs adjustment for the 64-bit counter treatment.
...
Not sure how endianness affects the code to move words around.
The routine "suffers" from the same effect as chacha: The 32-bit input
operands are in host order in memory and their individual values end up
correctly in the registers. But since vldm loads consecutive 64-bit
values, it ends up transposing 32-bit words that comprise the 64-bit
register value. After the initial swap and transpose operations, the X
and Y matrices are basically correctly filled but flipped two ways.
I've tried to document what I see in the registers on armeb to get a
handle on how to proceed:
vtrn.32	X0, Y3		C X0:  0  0  2  2  Y3:  1  1  3  3
    vtrn.32	X1, Y0		C X1:  4  4  6  6  Y0:  5  5  7  7
-	vtrn.32	X2, Y1		C X2:  8  8 10 10  Y1:  9  9  1  1 <- typo?
+	vtrn.32	X2, Y1		C X2:  8  8 10 10  Y1:  9  9 11 11
    vtrn.32	X3, Y2		C X3: 12 12 14 14  Y2: 13 13 15 15
+				C BE:
+				C X0:  3  3  1  1  Y3:  2  2  0  0
+				C X1:  7  7  5  5  Y0:  6  6  4  4
+				C X2: 11 11  9  9  Y1: 10 10  8  8
+				C X3: 15 15 13 13  Y2: 14 14 12 12
C Swap, to get
        C X0:  0 10  Y0:  5 15
        C X1:  4 14  Y1:  9  3
        C X2:  8  2  Y2: 13  7
        C X3: 12  6  Y3:  1 11
        vswp    D1REG(X0), D1REG(X2)
        vswp    D1REG(X1), D1REG(X3)
        vswp    D1REG(Y0), D1REG(Y2)
        vswp    D1REG(Y1), D1REG(Y3)
+	C BE:
+	C X0: 11  1  Y0: 14  4
+	C X1: 15  5  Y1:  2  8
+	C X2:  3  9  Y2:  6 12
+	C X3:  7 13  Y3: 10  0
I wonder if the code working on them contains some symmetry that could
be exploited to (with minimal changes) get correct results on these
transposed matrices.
Otherwise I wonder if it would be possible for both chacha and salsa to
change the actual loading and storing so there's no transposing of
32-bit operands. I looked at vld4.32 but that does some fancy
de-interleaving and needs two operations to load four q registers.
Otherwise we'd need a lot of vrev64.u32s to basically revert the 32-bit
transposition happening upon load and save to end up with identical
matrices to LE.
-- 
Michael

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Release of Nettle-3.7?