Michael Weiser michael.weiser@gmx.de writes:
See the attached patch for my current approach to fixing it, which is explicit transposing, adding and then transposing again to be as transposed as the other operands.
I haven't yet read the code, but I have some comments based on your description only.
I wonder if the surrounding C code could be changed to supply that part of the state as a 64-bit doubleword in host endianness to the assembler routine to cut down on adjustment.
I think it will be a bit cumbersum to change the interface to the C code.
Alternatively, could the 64-bit operation be broken down into two 32-bit operations which implicitly adjust to the transposed 32-bit words on BE?
Maybe. But we still need to propagate the carry, can that be done in a better way than transpose, 64-bit add, transpose?
I've tried to document what I see in the registers on armeb to get a handle on how to proceed:
vtrn.32 X0, Y3 C X0: 0 0 2 2 Y3: 1 1 3 3 vtrn.32 X1, Y0 C X1: 4 4 6 6 Y0: 5 5 7 7
- vtrn.32 X2, Y1 C X2: 8 8 10 10 Y1: 9 9 1 1 <- typo?
- vtrn.32 X2, Y1 C X2: 8 8 10 10 Y1: 9 9 11 11
Indeed a typo. I just checked in the fix, thanks!
vtrn.32 X3, Y2 C X3: 12 12 14 14 Y2: 13 13 15 15
C BE:
C X0: 3 3 1 1 Y3: 2 2 0 0
C X1: 7 7 5 5 Y0: 6 6 4 4
C X2: 11 11 9 9 Y1: 10 10 8 8
C X3: 15 15 13 13 Y2: 14 14 12 12
Also, it's somewhat important to keep track of which block a word belongs to. In the LE code, X0 really is A0 B0 A2 B2, where A refers to the first block, and B to the second.
What's the layout before the transpose, immediately after load? I'd guess you get X0: 1 0 3 2?
For the little endian code, the transpose can be viewed as
X0: A0 A1 A2 A3 / / denotes elements swapped. Y3: B0 B1 B2 B3
If instead we start with the order 1 0 3 2, we get the same result (but with registers swapped) if we do
Y3: B1 B0 B3 B2 \ \ X0: A1 A0 A3 A2
So I would expect there's some clever way to get the BE case to work with about the same number of transpose instructions, even if initial word order is somewhat different.
I wonder if the code working on them contains some symmetry that could be exploited to (with minimal changes) get correct results on these transposed matrices.
At least, both blocks are treated equally (except that the initial counter addition is done to only the second block, and that the final result is written in the right order. So it doesn't matter if X0 contains A0 B0 A2 B2 or B0 A0 B2 A2. And unlike the one-way code, we only use
vext32 ... #2
to rotate data between rounds, never #1 or #3.
Otherwise I wonder if it would be possible for both chacha and salsa to change the actual loading and storing so there's no transposing of 32-bit operands. I looked at vld4.32 but that does some fancy de-interleaving and needs two operations to load four q registers.
The new powerpc code uses load and store instructions that behave the same in this respect, for both BE and LE. But not sure if there's any easy way on ARM. I'm not that familiar with the more special load and store instructions. Would vst2.32 be useful in some way for the final store (and vst3.32 for chacha-3core)?
Otherwise we'd need a lot of vrev64.u32s to basically revert the 32-bit transposition happening upon load and save to end up with identical matrices to LE.
If that's an easier way to get it working, I think it's a good start. I'd expect that's still give a reasonable speedup over the 1-way version.
Regards, /Niels