Hello Niels,
On Mon, Dec 21, 2020 at 09:16:25PM +0100, Niels Möller wrote:
What's the layout before the transpose, immediately after load? I'd guess you get X1: 1 0 3 2?
TL;DR: Yes, it is. I abandoned this approach for now though, since I found some options to eliminate the word transposition effect of vldm/vstm in the first place (see below).
Longer story for completeness: It seems I ran afoul gdb's way of displaying registers in memory endianness again. I knew all this once already.[1] I should likely do this more often than every couple of years. ;)
[1] https://marc.info/?l=nettle-bugs&m=152436948907236&w=2
On LE I get after the initial load:
Breakpoint 1, _nettle_salsa20_2core () at salsa20-2core.s:39 39 vldm r1, {q0,q1,q2,q3} (gdb) s 40 adr r12, .Lcount1 (gdb) i r q0 q1 q2 q3 q0 {u8 = {0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0}, u16 = {0x0, 0x0, 0x1, 0x0, 0x2, 0x0, 0x3, 0x0}, u32 = {0x0, 0x1, 0x2, 0x3}, u64 = {0x100000000, 0x300000002}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}} q1 {u8 = {0x4, 0x0, 0x0, 0x0, 0x5, 0x0, 0x0, 0x0, 0x6, 0x0, 0x0, 0x0, 0x7, 0x0, 0x0, 0x0}, u16 = {0x4, 0x0, 0x5, 0x0, 0x6, 0x0, 0x7, 0x0}, u32 = {0x4, 0x5, 0x6, 0x7}, u64 = {0x500000004, 0x700000006}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}} q2 {u8 = {0xff, 0xff, 0xff, 0xff, 0x9, 0x0, 0x0, 0x0, 0xa, 0x0, 0x0, 0x0, 0xb, 0x0, 0x0, 0x0}, u16 = {0xffff, 0xffff, 0x9, 0x0, 0xa, 0x0, 0xb, 0x0}, u32 = {0xffffffff, 0x9, 0xa, 0xb}, u64 = {0x9ffffffff, 0xb0000000a}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}} q3 {u8 = {0xc, 0x0, 0x0, 0x0, 0xd, 0x0, 0x0, 0x0, 0xe, 0x0, 0x0, 0x0, 0xf, 0x0, 0x0, 0x0}, u16 = {0xc, 0x0, 0xd, 0x0, 0xe, 0x0, 0xf, 0x0}, u32 = {0xc, 0xd, 0xe, 0xf}, u64 = {0xd0000000c, 0xf0000000e}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}}
On the u8 representation we can see that gdb prints them as if they were stored little-endian in memory. That's why the u32 representaion actually matches up with our expectations.
On BE I get:
Breakpoint 1, _nettle_salsa20_2core () at salsa20-2core.s:39 39 vldm r1, {q0,q1,q2,q3} (gdb) s 40 adr r12, .Lcount1 (gdb) i r q0 q1 q2 q3 q0 {u8 = {0x0, 0x0, 0x0, 0x2, 0x0, 0x0, 0x0, 0x3, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x1}, u16 = {0x0, 0x2, 0x0, 0x3, 0x0, 0x0, 0x0, 0x1}, u32 = {0x2, 0x3, 0x0, 0x1}, u64 = {0x200000003, 0x1}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}} q1 {u8 = {0x0, 0x0, 0x0, 0x6, 0x0, 0x0, 0x0, 0x7, 0x0, 0x0, 0x0, 0x4, 0x0, 0x0, 0x0, 0x5}, u16 = {0x0, 0x6, 0x0, 0x7, 0x0, 0x4, 0x0, 0x5}, u32 = {0x6, 0x7, 0x4, 0x5}, u64 = {0x600000007, 0x400000005}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}} q2 {u8 = {0x0, 0x0, 0x0, 0xa, 0x0, 0x0, 0x0, 0xb, 0xff, 0xff, 0xff, 0xff, 0x0, 0x0, 0x0, 0x9}, u16 = {0x0, 0xa, 0x0, 0xb, 0xffff, 0xffff, 0x0, 0x9}, u32 = {0xa, 0xb, 0xffffffff, 0x9}, u64 = {0xa0000000b, 0xffffffff00000009}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}} q3 {u8 = {0x0, 0x0, 0x0, 0xe, 0x0, 0x0, 0x0, 0xf, 0x0, 0x0, 0x0, 0xc, 0x0, 0x0, 0x0, 0xd}, u16 = {0x0, 0xe, 0x0, 0xf, 0x0, 0xc, 0x0, 0xd}, u32 = {0xe, 0xf, 0xc, 0xd}, u64 = {0xe0000000f, 0xc0000000d}, f32 = {0x0, 0x0, 0x0, 0x0}, f64 = {0x0, 0x0}}
Here gdb prints them as if they were stored in big-endian order in memory. So we have to read them in reverse to compare them to the LE output.
That would mean, if we read the LE output for q0.u32 as 0 1 2 3, the equivalent BE output would read 1 0 3 2 just as you guessed. So you're right and all my notes in the code are likely wrong because I read the output the wrong way around (again).
Otherwise I wonder if it would be possible for both chacha and salsa to change the actual loading and storing so there's no transposing of 32-bit operands. I looked at vld4.32 but that does some fancy de-interleaving and needs two operations to load four q registers.
The new powerpc code uses load and store instructions that behave the same in this respect, for both BE and LE. But not sure if there's any easy way on ARM. I'm not that familiar with the more special load and store instructions. Would vst2.32 be useful in some way for the final store (and vst3.32 for chacha-3core)?
For this I have found two candidates, once I wrapped my head around the (de-)interleaving part of VLDn/VSTn:
Option 1: VLDn.dt/VSTn.dt[2, C.13.5, page C-63]: It turns out, the n in VLDn/VSTn is the number of interleaved elements and the .dt is the width/datatype of those elements. So vld2.32 loads 32-bit operands from memory which it assumes to be two interleaved vectors. So it sends odd-numbered elements to one register and even-numbered to another. We don't need nor want that. That's where vld1/vst1 come in: They do no (de-)interleaving, just sequential loading or storing of elements. (It's already in use in umac-nh.asm but I didn't remember.)
The number of elements it loads only depends on the number of registers given. So vld1.64 {q0, q1}, [r1] does not mean "load one 64-bit operand into some part of q0 or q1" but "load 64-bit operands sequentially without deinterleaving until q0 and q1 are 'full'", i.e. four of them.
So for our case where we have a matrix of 32-bit words in host endianness that we need to load sequentially into q registers without any transposing we can use vld1.32 {q0, q1}, [r1].
This is also a drop-in fix for the 64-bit counter addition.
The drawback compared to vldm is that we need to issue two operations to load four q registers because each vld1/vst1 can only work with up to four d (i.e. two q) registers. This also means that we need to increment the base address for the second load which requires a scratch register if we want to keep the original value for later reference.
Regarding performance I found a document from ARM for the Cortex-A8 which had some cycle numbers[2]. According to it, two vld1's should take (at worst/no alignment) six cycles where vldm would run five cycles for the same amount of registers. This doesn't include any mov necessary to initialise the base address scratch register. The element size (e.g. .8 vs. .64) doesn't seem to play into it at all. It gets faster with better alignment. Here's a quick calculation with a bit of code for illustration:
C vst1.8 because caller expects results little-endian C speed: C 1 q register == 2 d registers, doc talks d registers C vstm: (number of registers/2) + mod(number of registers, 2) + 1 == (8/2) + mod(8, 2) + 1 == 4 + 0 + 1 = 5 cycles C vst1.8: 2ops each 4-reg unaligned: 2*3 == 6 cycles (plus potentially mov to set up address counter) IF_LE(` vstm DST, {X0,X1,X2,X3}') IF_BE(` vst1.8 {X0,X1}, [DST]! vst1.8 {X2,X3}, [DST]')
My feeling is that it doesn't matter much because it happens outside the main loop.
Attached are two patches be-neon-asm-2.diff and 0002-arm-Unify-neon-asm-for-big-and-little-endian-modes.patch for illustration what using those intructions would look like. An armeb CI run is at https://gitlab.com/michaelweiser/nettle/-/jobs/932123909.
As expected, all the special treatment of transposed operands can just go away because it doesn't happen any more. Also, vld1.32 (for sequential loads of 32-bit operands in host-endianness) and vld1.8 (for sequential store of register contents to get an implicit little-endian store without any vrev32.u8s) works the same on LE as well as BE. So we could use those as separate BE implementation and leave the LE code conditionalized but otherwise intact or we could unify the code to work for both cases without difference.
Option 2: By coincidence I found that vldm/vstm can work with s registers originally intended for use with VFP. They're just a different view of the d0-d15 or q0-q7 registers. When giving s registers as arguments to vldm/vstm they start to behave identically to vst1.32, i.e. load/save 32-bit words sequentially.
The drawback is that only q0 through q7 are mapped as s0 through s31. So we cannot use that mechanism to load directly into the higher eight q registers. The attached patch be-neon-asm-1.diff showcases what using those would look like. Where necessary, I loaded or stored via s0-s15 (i.e. q0-q3) using vmov or the already presetn vrev32s. Since that routinely clobbers those lower registers, I needed to add a second reload of the original context into T0-T3 in salsa20-2core.asm.
Also, it's not entirely clear to me from the documentation if this will work on every ARM core that supports NEON. The NEON programmer's guide[3] states that VLDM/VSTM is a shared VFP/NEON instruction and s registers *can* be specified. I read that to mean that it will work on every NEON core. It appears that every core that has NEON also has at least VFP3 but I've found no definite statement to that effect. Some sources speak of NEON as an extension to VFP but I've found no confirmation by ARM.
Also it does not get rid of all those vrev32.u8s before store on BE. All in all, option 1 (vld1/vst1) seems more straightforward and elegant to me. We could also oppotunistically use both approaches where they fit best, i.e. vldm/vstm when working with q0-q7 and vld1.{8,32} for q8-q15.
[2] https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf [3] https://developer.arm.com/documentation/ddi0344/b/instruction-cycle-timing/i...