Re: Release of Nettle-3.7?

22 Dec 2020


      Michael Weiser michael.weiser@gmx.de writes:
...
See the attached patch for my current approach to fixing it, which is
explicit transposing, adding and then transposing again to be as
transposed as the other operands.
I haven't yet read the code, but I have some comments based on your
description only.
...
I wonder if the surrounding C code
could be changed to supply that part of the state as a 64-bit doubleword
in host endianness to the assembler routine to cut down on adjustment.
I think it will be a bit cumbersum to change the interface to the C
code.
...
Alternatively, could the 64-bit operation be broken down into two 32-bit
operations which implicitly adjust to the transposed 32-bit words on BE?
Maybe. But we still need to propagate the carry, can that be done in a
better way than transpose, 64-bit add, transpose?
...
I've tried to document what I see in the registers on armeb to get a
handle on how to proceed:
vtrn.32	X0, Y3		C X0:  0  0  2  2  Y3:  1  1  3  3
   vtrn.32	X1, Y0		C X1:  4  4  6  6  Y0:  5  5  7  7

vtrn.32	X2, Y1		C X2:  8  8 10 10  Y1:  9  9  1  1 <- typo?


vtrn.32	X2, Y1		C X2:  8  8 10 10  Y1:  9  9 11 11

Indeed a typo. I just checked in the fix, thanks!
...
vtrn.32	X3, Y2		C X3: 12 12 14 14  Y2: 13 13 15 15

		C BE:


		C X0:  3  3  1  1  Y3:  2  2  0  0


		C X1:  7  7  5  5  Y0:  6  6  4  4


		C X2: 11 11  9  9  Y1: 10 10  8  8


		C X3: 15 15 13 13  Y2: 14 14 12 12


Also, it's somewhat important to keep track of which block a word
belongs to. In the LE code, X0 really is A0 B0 A2 B2, where A refers to
the first block, and B to the second.
What's the layout before the transpose, immediately after load? I'd
guess you get X0: 1 0 3 2?
For the little endian code, the transpose can be viewed as
X0:  A0 A1 A2 A3
         /     /    denotes elements swapped.
  Y3:  B0 B1 B2 B3
If instead we start with the order 1 0 3 2, we get the same result (but
with registers swapped) if we do
Y3:  B1 B0 B3 B2
         \     \
  X0:  A1 A0 A3 A2
So I would expect there's some clever way to get the BE case to work
with about the same number of transpose instructions, even if initial
word order is somewhat different.
...
I wonder if the code working on them contains some symmetry that could
be exploited to (with minimal changes) get correct results on these
transposed matrices.
At least, both blocks are treated equally (except that the initial
counter addition is done to only the second block, and that the final result
is written in the right order. So it doesn't matter if X0 contains A0 B0
A2 B2 or B0 A0 B2 A2. And unlike the one-way code, we only use
vext32 ... #2
to rotate data between rounds, never #1 or #3.
...
Otherwise I wonder if it would be possible for both chacha and salsa to
change the actual loading and storing so there's no transposing of
32-bit operands. I looked at vld4.32 but that does some fancy
de-interleaving and needs two operations to load four q registers.
The new powerpc code uses load and store instructions that behave the
same in this respect, for both BE and LE. But not sure if there's any
easy way on ARM. I'm not that familiar with the more special load and
store instructions. Would vst2.32 be useful in some way for the final
store (and vst3.32 for chacha-3core)?
...
Otherwise we'd need a lot of vrev64.u32s to basically revert the 32-bit
transposition happening upon load and save to end up with identical
matrices to LE.
If that's an easier way to get it working, I think it's a good start.
I'd expect that's still give a reasonable speedup over the 1-way
version.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Release of Nettle-3.7?