Michael Weiser michael.weiser@gmx.de writes:
Longer story for completeness: It seems I ran afoul gdb's way of displaying registers in memory endianness again. I knew all this once already.[1] I should likely do this more often than every couple of years. ;)
I'm always confused by the conventions for ordering of the components of vector registers. When I write out values in code comments, I try to use the order in which the elements appeared in memory.
So for our case where we have a matrix of 32-bit words in host endianness that we need to load sequentially into q registers without any transposing we can use vld1.32 {q0, q1}, [r1].
This is also a drop-in fix for the 64-bit counter addition.
Sounds good.
The drawback compared to vldm is that we need to issue two operations to load four q registers because each vld1/vst1 can only work with up to four d (i.e. two q) registers. This also means that we need to increment the base address for the second load which requires a scratch register if we want to keep the original value for later reference.
Since we have plenty of registers available, (including r3 which seems unused and free to clobber), I'd suggest using
define(`SRCp32', `r3')
and an
add SRCp32, SRC, #32
in function entry, and then leave both SRC and SRCp32 unmodified for the rest of the function.
Regarding performance I found a document from ARM for the Cortex-A8 which had some cycle numbers[2]. According to it, two vld1's should take (at worst/no alignment) six cycles where vldm would run five cycles for the same amount of registers. [...]
My feeling is that it doesn't matter much because it happens outside the main loop.
If it's just a cycle or two per call, I think it's ok.
As expected, all the special treatment of transposed operands can just go away because it doesn't happen any more. Also, vld1.32 (for sequential loads of 32-bit operands in host-endianness) and vld1.8 (for sequential store of register contents to get an implicit little-endian store without any vrev32.u8s) works the same on LE as well as BE.
Neat. Use of vld1.8 is worth a commment in the code (and/or arm/README).
Option 2: By coincidence I found that vldm/vstm can work with s registers originally intended for use with VFP. They're just a different view of the d0-d15 or q0-q7 registers. When giving s registers as arguments to vldm/vstm they start to behave identically to vst1.32, i.e. load/save 32-bit words sequentially.
[...]
Also, it's not entirely clear to me from the documentation if this will work on every ARM core that supports NEON. The NEON programmer's guide[3] states that VLDM/VSTM is a shared VFP/NEON instruction and s registers *can* be specified. I read that to mean that it will work on every NEON core. It appears that every core that has NEON also has at least VFP3 but I've found no definite statement to that effect. Some sources speak of NEON as an extension to VFP but I've found no confirmation by ARM.
That sounds a bit complicated, and since there's no great benefit over vld1, maybe best to stay away from that?
All in all, option 1 (vld1/vst1) seems more straightforward and elegant to me.
Sounds good to me too.
From 07c7ea6d62b33aa0c3e176c0e54ffc409fd78516 Mon Sep 17 00:00:00 2001 From: Michael Weiser michael.weiser@gmx.de Date: Fri, 25 Dec 2020 17:13:52 +0100 Subject: [PATCH 2/2] arm: Unify neon asm for big- and little-endian modes
Switch arm neon assemlber routines to endianness-agnostic loads and stores where possible to avoid modifications to the rest of the code. This involves switching to vld1.32 for loading consecutive 32-bit words in host endianness as well as vst1.8 for storing back to memory in little-endian order as required by the caller.
I like this approach. It would be nice if you coudl benchmark it on little-endian, to verify that there's no unexpectedly large speed regression (a regression of just cycle or two per block, if that's at all measurable, is ok, I think).
PROLOGUE(_nettle_chacha_3core)
- vldm SRC, {X0,X1,X2,X3}
- mov r12, SRC
- vld1.32 {X0,X1}, [r12]!
- vld1.32 {X2,X3}, [r12]
My suggestion is to do this as
add SRCp32, SRC, #32 vld1.32 {X0,X1}, [SRC] vld1.32 {X2,X3}, [SRCp32]
and reuse SRCp32 for the second load of the same data, further down (assuming r3 really is free to use for this purpose; if we have to save and restore a register to do this, your approach with temporary use of r12 seems better). Another option, with no need for an extra registerm is to just use post-increment, modifying SRC here. And either explicitly subtract 32, or use opposite load order and pre-decrement for the second load.
Regards, /Niels