Hello Niels,
On Sat, Dec 19, 2020 at 09:51:45AM +0100, Niels Möller wrote:
Porting over the basic IF_[LB]E mechanism from chacha-core-internal was easy and fixed up the first of the three interleaved blocks right away. For the other two I am still in the process of wrapping my head around how the interleaving works and how it would need some adjustment for BE.
The 3-way functions don't do anything fancy, just each of the three blocks represented in separate registers, and same instruction sequence as for the 1-way version, duplicated threee times and interleaved.
I've got the tests passing for chacha now. Apart from the straightforward porting-over of the BE shift and reverse-on-store logic from chacha20-core-internal.asm special treatment is necessary for the part of the state that's treated as a 64-bit counter. The two 32-bit words it's comprised of are in host-endianness but consecutive order. So they get reversed by the BE load. This is actually the case for all 32-bit operands throughout the routine on BE (and for chacha-core-internal also) and cancels itself out on the final store. But for the 64-bit counter it needs to be taken into account for the addition to produce correct results.
See the attached patch for my current approach to fixing it, which is explicit transposing, adding and then transposing again to be as transposed as the other operands. I wonder if the surrounding C code could be changed to supply that part of the state as a 64-bit doubleword in host endianness to the assembler routine to cut down on adjustment.
Alternatively, could the 64-bit operation be broken down into two 32-bit operations which implicitly adjust to the transposed 32-bit words on BE?
The 2-way version (for ARM, that's salsa only) tries to be a bit more clever, with registers representing either odd or even words from both blocks.
For a start this also needs adjustment for the 64-bit counter treatment.
Not sure how endianness affects the code to move words around.
The routine "suffers" from the same effect as chacha: The 32-bit input operands are in host order in memory and their individual values end up correctly in the registers. But since vldm loads consecutive 64-bit values, it ends up transposing 32-bit words that comprise the 64-bit register value. After the initial swap and transpose operations, the X and Y matrices are basically correctly filled but flipped two ways.
I've tried to document what I see in the registers on armeb to get a handle on how to proceed:
vtrn.32 X0, Y3 C X0: 0 0 2 2 Y3: 1 1 3 3 vtrn.32 X1, Y0 C X1: 4 4 6 6 Y0: 5 5 7 7 - vtrn.32 X2, Y1 C X2: 8 8 10 10 Y1: 9 9 1 1 <- typo? + vtrn.32 X2, Y1 C X2: 8 8 10 10 Y1: 9 9 11 11 vtrn.32 X3, Y2 C X3: 12 12 14 14 Y2: 13 13 15 15 + C BE: + C X0: 3 3 1 1 Y3: 2 2 0 0 + C X1: 7 7 5 5 Y0: 6 6 4 4 + C X2: 11 11 9 9 Y1: 10 10 8 8 + C X3: 15 15 13 13 Y2: 14 14 12 12
C Swap, to get C X0: 0 10 Y0: 5 15 C X1: 4 14 Y1: 9 3 C X2: 8 2 Y2: 13 7 C X3: 12 6 Y3: 1 11 vswp D1REG(X0), D1REG(X2) vswp D1REG(X1), D1REG(X3) vswp D1REG(Y0), D1REG(Y2) vswp D1REG(Y1), D1REG(Y3)
+ C BE: + C X0: 11 1 Y0: 14 4 + C X1: 15 5 Y1: 2 8 + C X2: 3 9 Y2: 6 12 + C X3: 7 13 Y3: 10 0
I wonder if the code working on them contains some symmetry that could be exploited to (with minimal changes) get correct results on these transposed matrices.
Otherwise I wonder if it would be possible for both chacha and salsa to change the actual loading and storing so there's no transposing of 32-bit operands. I looked at vld4.32 but that does some fancy de-interleaving and needs two operations to load four q registers.
Otherwise we'd need a lot of vrev64.u32s to basically revert the 32-bit transposition happening upon load and save to end up with identical matrices to LE.