Re: Release of Nettle-3.7?

25 Dec 2020


      Hello Niels,
On Mon, Dec 21, 2020 at 09:16:25PM +0100, Niels Möller wrote:
...
What's the layout before the transpose, immediately after load? I'd
guess you get X1: 1 0 3 2?
TL;DR: Yes, it is. I abandoned this approach for now though, since I
found some options to eliminate the word transposition effect of
vldm/vstm in the first place (see below).
Longer story for completeness: It seems I ran afoul gdb's way of
displaying registers in memory endianness again. I knew all this once
already.[1] I should likely do this more often than every couple of
years. ;)
[1] https://marc.info/?l=nettle-bugs&m=152436948907236&w=2
On LE I get after the initial load:
Breakpoint 1, _nettle_salsa20_2core () at salsa20-2core.s:39
39              vldm    r1, {q0,q1,q2,q3}
(gdb) s
40              adr     r12, .Lcount1
(gdb) i r q0 q1 q2 q3
q0             {u8 = {0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0},
    	u16 = {0x0, 0x0, 0x1, 0x0, 0x2, 0x0, 0x3, 0x0},
    	u32 = {0x0, 0x1, 0x2, 0x3},
    	u64 = {0x100000000, 0x300000002}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}}
q1             {u8 = {0x4, 0x0, 0x0, 0x0, 0x5, 0x0, 0x0, 0x0, 0x6, 0x0, 0x0, 0x0, 0x7, 0x0, 0x0, 0x0},
    	u16 = {0x4, 0x0, 0x5, 0x0, 0x6, 0x0, 0x7, 0x0},
    	u32 = {0x4, 0x5, 0x6, 0x7},
    	u64 = {0x500000004, 0x700000006}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}}
q2             {u8 = {0xff, 0xff, 0xff, 0xff, 0x9, 0x0, 0x0, 0x0, 0xa, 0x0, 0x0, 0x0, 0xb, 0x0, 0x0, 0x0},
    	u16 = {0xffff, 0xffff, 0x9, 0x0, 0xa, 0x0, 0xb, 0x0},
    	u32 = {0xffffffff, 0x9, 0xa, 0xb},
    	u64 = {0x9ffffffff, 0xb0000000a}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}}
q3             {u8 = {0xc, 0x0, 0x0, 0x0, 0xd, 0x0, 0x0, 0x0, 0xe, 0x0, 0x0, 0x0, 0xf, 0x0, 0x0, 0x0},
    	u16 = {0xc, 0x0, 0xd, 0x0, 0xe, 0x0, 0xf, 0x0},
    	u32 = {0xc, 0xd, 0xe, 0xf},
    	u64 = {0xd0000000c, 0xf0000000e}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}}
On the u8 representation we can see that gdb prints them as if they were
stored little-endian in memory. That's why the u32 representaion
actually matches up with our expectations.
On BE I get:
Breakpoint 1, _nettle_salsa20_2core () at salsa20-2core.s:39
39              vldm    r1, {q0,q1,q2,q3}
(gdb) s
40              adr     r12, .Lcount1
(gdb) i r q0 q1 q2 q3
q0             {u8 = {0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1},
    	u16 = {0x0, 0x2, 0x0, 0x3, 0x0, 0x0, 0x0, 0x1},
    	u32 = {0x2, 0x3, 0x0, 0x1},
    	u64 = {0x200000003, 0x1}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}}
q1             {u8 = {0x0, 0x0, 0x0, 0x6, 0x0, 0x0, 0x0, 0x7, 0x0, 0x0, 0x0, 0x4, 0x0, 0x0, 0x0, 0x5},
    	u16 = {0x0, 0x6, 0x0, 0x7, 0x0, 0x4, 0x0, 0x5},
    	u32 = {0x6, 0x7, 0x4, 0x5},
    	u64 = {0x600000007, 0x400000005}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}}
q2             {u8 = {0x0, 0x0, 0x0, 0xa, 0x0, 0x0, 0x0, 0xb, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x9},
    	u16 = {0x0, 0xa, 0x0, 0xb, 0xffff, 0xffff, 0x0, 0x9},
    	u32 = {0xa, 0xb, 0xffffffff, 0x9},
    	u64 = {0xa0000000b, 0xffffffff00000009}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}}
q3             {u8 = {0x0, 0x0, 0x0, 0xe, 0x0, 0x0, 0x0, 0xf, 0x0, 0x0, 0x0, 0xc, 0x0, 0x0, 0x0, 0xd},
    	u16 = {0x0, 0xe, 0x0, 0xf, 0x0, 0xc, 0x0, 0xd},
    	u32 = {0xe, 0xf, 0xc, 0xd},
    	u64 = {0xe0000000f, 0xc0000000d}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}}
Here gdb prints them as if they were stored in big-endian order in
memory. So we have to read them in reverse to compare them to the LE
output.
That would mean, if we read the LE output for q0.u32 as 0 1 2 3, the
equivalent BE output would read 1 0 3 2 just as you guessed. So you're
right and all my notes in the code are likely wrong because I read the
output the wrong way around (again).
...
...
Otherwise I wonder if it would be possible for both chacha and salsa to
change the actual loading and storing so there's no transposing of
32-bit operands. I looked at vld4.32 but that does some fancy
de-interleaving and needs two operations to load four q registers.
The new powerpc code uses load and store instructions that behave the
same in this respect, for both BE and LE. But not sure if there's any
easy way on ARM. I'm not that familiar with the more special load and
store instructions. Would vst2.32 be useful in some way for the final
store (and vst3.32 for chacha-3core)?
For this I have found two candidates, once I wrapped my head around the
(de-)interleaving part of VLDn/VSTn:
Option 1: VLDn.dt/VSTn.dt[2, C.13.5, page C-63]: It turns out, the n in
VLDn/VSTn is the number of interleaved elements and the .dt is the
width/datatype of those elements. So vld2.32 loads 32-bit operands from
memory which it assumes to be two interleaved vectors. So it sends
odd-numbered elements to one register and even-numbered to another. We
don't need nor want that. That's where vld1/vst1 come in: They do no
(de-)interleaving, just sequential loading or storing of elements. (It's
already in use in umac-nh.asm but I didn't remember.)
The number of elements it loads only depends on the number of registers
given. So vld1.64 {q0, q1}, [r1] does not mean "load one 64-bit operand
into some part of q0 or q1" but "load 64-bit operands sequentially
without deinterleaving until q0 and q1 are 'full'", i.e. four of them.
So for our case where we have a matrix of 32-bit words in host
endianness that we need to load sequentially into q registers without
any transposing we can use vld1.32 {q0, q1}, [r1].
This is also a drop-in fix for the 64-bit counter addition.
The drawback compared to vldm is that we need to issue two operations to
load four q registers because each vld1/vst1 can only work with up to
four d (i.e. two q) registers. This also means that we need to increment
the base address for the second load which requires a scratch register
if we want to keep the original value for later reference.
Regarding performance I found a document from ARM for the Cortex-A8
which had some cycle numbers[2]. According to it, two vld1's should take
(at worst/no alignment) six cycles where vldm would run five cycles for the
same amount of registers. This doesn't include any mov necessary to
initialise the base address scratch register. The element size (e.g. .8
vs. .64) doesn't seem to play into it at all. It gets faster with better
alignment. Here's a quick calculation with a bit of code for
illustration:
C vst1.8 because caller expects results little-endian
C speed:
C 1 q register == 2 d registers, doc talks d registers
C vstm: (number of registers/2) + mod(number of registers, 2) + 1 == (8/2) + mod(8, 2) + 1 == 4 + 0 + 1 = 5 cycles
C vst1.8: 2ops each 4-reg unaligned: 2*3 == 6 cycles (plus potentially mov to set up address counter)
IF_LE(`	vstm    DST, {X0,X1,X2,X3}')
IF_BE(`	vst1.8  {X0,X1}, [DST]!
    vst1.8  {X2,X3}, [DST]')
My feeling is that it doesn't matter much because it happens outside the
main loop.
Attached are two patches be-neon-asm-2.diff and
0002-arm-Unify-neon-asm-for-big-and-little-endian-modes.patch for
illustration what using those intructions would look like. An armeb CI
run is at https://gitlab.com/michaelweiser/nettle/-/jobs/932123909.
As expected, all the special treatment of transposed operands can just
go away because it doesn't happen any more. Also, vld1.32 (for
sequential loads of 32-bit operands in host-endianness) and vld1.8 (for
sequential store of register contents to get an implicit little-endian
store without any vrev32.u8s) works the same on LE as well as BE. So we
could use those as separate BE implementation and leave the LE code
conditionalized but otherwise intact or we could unify the code to work
for both cases without difference.
Option 2: By coincidence I found that vldm/vstm can work with s
registers originally intended for use with VFP. They're just a different
view of the d0-d15 or q0-q7 registers. When giving s registers as
arguments to vldm/vstm they start to behave identically to vst1.32, i.e.
load/save 32-bit words sequentially.
The drawback is that only q0 through q7 are mapped as s0 through s31. So
we cannot use that mechanism to load directly into the higher eight q
registers. The attached patch be-neon-asm-1.diff showcases what using
those would look like. Where necessary, I loaded or stored via s0-s15 (i.e.
q0-q3) using vmov or the already presetn vrev32s. Since that routinely
clobbers those lower registers, I needed to add a second reload of the
original context into T0-T3 in salsa20-2core.asm.
Also, it's not entirely clear to me from the documentation if this will
work on every ARM core that supports NEON. The NEON programmer's
guide[3] states that VLDM/VSTM is a shared VFP/NEON instruction and s
registers *can* be specified. I read that to mean that it will work on
every NEON core. It appears that every core that has NEON also has at
least VFP3 but I've found no definite statement to that effect.  Some
sources speak of NEON as an extension to VFP but I've found no
confirmation by ARM.
Also it does not get rid of all those vrev32.u8s before store on BE. All
in all, option 1 (vld1/vst1) seems more straightforward and elegant to
me. We could also oppotunistically use both approaches where they fit
best, i.e. vldm/vstm when working with q0-q7 and vld1.{8,32} for q8-q15.
[2] https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf
[3] https://developer.arm.com/documentation/ddi0344/b/instruction-cycle-timing/i...
-- 
Happy holidays,
Michael

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Release of Nettle-3.7?