Re: Release of Nettle-3.7?

25 Dec 2020


      Michael Weiser michael.weiser@gmx.de writes:
...
Longer story for completeness: It seems I ran afoul gdb's way of
displaying registers in memory endianness again. I knew all this once
already.[1] I should likely do this more often than every couple of
years. ;)
I'm always confused by the conventions for ordering of the components of
vector registers. When I write out values in code comments, I try to use
the order in which the elements appeared in memory.
...
So for our case where we have a matrix of 32-bit words in host
endianness that we need to load sequentially into q registers without
any transposing we can use vld1.32 {q0, q1}, [r1].
This is also a drop-in fix for the 64-bit counter addition.
Sounds good.
...
The drawback compared to vldm is that we need to issue two operations to
load four q registers because each vld1/vst1 can only work with up to
four d (i.e. two q) registers. This also means that we need to increment
the base address for the second load which requires a scratch register
if we want to keep the original value for later reference.
Since we have plenty of registers available, (including r3 which seems
unused and free to clobber), I'd suggest using
define(`SRCp32', `r3')
and an
add SRCp32, SRC, #32
in function entry, and then leave both SRC and SRCp32 unmodified for the
rest of the function.
...
Regarding performance I found a document from ARM for the Cortex-A8
which had some cycle numbers[2]. According to it, two vld1's should take
(at worst/no alignment) six cycles where vldm would run five cycles for the
same amount of registers. [...]
...
My feeling is that it doesn't matter much because it happens outside the
main loop.
If it's just a cycle or two per call, I think it's ok.
...
As expected, all the special treatment of transposed operands can just
go away because it doesn't happen any more. Also, vld1.32 (for
sequential loads of 32-bit operands in host-endianness) and vld1.8 (for
sequential store of register contents to get an implicit little-endian
store without any vrev32.u8s) works the same on LE as well as BE.
Neat. Use of vld1.8 is worth a commment in the code (and/or arm/README).
...
Option 2: By coincidence I found that vldm/vstm can work with s
registers originally intended for use with VFP. They're just a different
view of the d0-d15 or q0-q7 registers. When giving s registers as
arguments to vldm/vstm they start to behave identically to vst1.32, i.e.
load/save 32-bit words sequentially.
[...]
...
Also, it's not entirely clear to me from the documentation if this will
work on every ARM core that supports NEON. The NEON programmer's
guide[3] states that VLDM/VSTM is a shared VFP/NEON instruction and s
registers *can* be specified. I read that to mean that it will work on
every NEON core. It appears that every core that has NEON also has at
least VFP3 but I've found no definite statement to that effect.  Some
sources speak of NEON as an extension to VFP but I've found no
confirmation by ARM.
That sounds a bit complicated, and since there's no great benefit over
vld1, maybe best to stay away from that?
...
All in all, option 1 (vld1/vst1) seems more straightforward and
elegant to me.
Sounds good to me too.
...
From 07c7ea6d62b33aa0c3e176c0e54ffc409fd78516 Mon Sep 17 00:00:00 2001
From: Michael Weiser michael.weiser@gmx.de
Date: Fri, 25 Dec 2020 17:13:52 +0100
Subject: [PATCH 2/2] arm: Unify neon asm for big- and little-endian modes
Switch arm neon assemlber routines to endianness-agnostic loads and
stores where possible to avoid modifications to the rest of the code.
This involves switching to vld1.32 for loading consecutive 32-bit words
in host endianness as well as vst1.8 for storing back to memory in
little-endian order as required by the caller.
I like this approach. It would be nice if you coudl benchmark it on
little-endian, to verify that there's no unexpectedly large speed
regression (a regression of just cycle or two per block, if that's at
all measurable, is ok, I think).
...
PROLOGUE(_nettle_chacha_3core)

vldm	SRC, {X0,X1,X2,X3}


mov	r12, SRC
vld1.32	{X0,X1}, [r12]!
vld1.32	{X2,X3}, [r12]

My suggestion is to do this as
add SRCp32, SRC, #32
    vld1.32	{X0,X1}, [SRC]
    vld1.32	{X2,X3}, [SRCp32]
and reuse SRCp32 for the second load of the same data, further down
(assuming r3 really is free to use for this purpose; if we have to save
and restore a register to do this, your approach with temporary use of
r12 seems better). Another option, with no need for an extra registerm
is to just use post-increment, modifying SRC here. And either explicitly
subtract 32, or use opposite load order and pre-decrement for the second
load.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Release of Nettle-3.7?