nisse@lysator.liu.se (Niels Möller) writes:
So now when I understand the algorithm better, I guess I'll have to rewrite the x86_64 implementation to reduce the number of passes there too.
I've rewritten it now. Gives only a slight speedup over the C code, andruns at roughly 16.5 cycles per byte, compared to 8 for sha1, 18 for sha256, and 12 for sha512. There surely are microptimizations left to do with all the juggling around of halves of xmm registers, but I doubt one can get down below 10 cycles with the current organization of the code.
The steps which seem hardest to fit with 128-bit instructions is the permutation and rotation steps, which appear to want to treat 64-bit words individually.
For rotations, the problem is that we need to rotate the two halves of an xmm register with different rotation counts. And then there are no rotate instructions, only shifts. So to rotate the two halves of %xmm0 by one and two bits (and not even trying to combine them back into the same register), one needs to do something like
movdqa %xmm0, %xmm1 movdqa %xmm0, %xmm2 movdqa %xmm0, %xmm3 psllq $1, %xmm0 psrlq $63, %xmm1 psllq $2, %xmm2 psrlq $62, %xmm3 por %xmm1, %xmm0 por %xmm3, %xmm2
which is an awful lot of instructions. Is there any simpler way to do it?
And the permutations are also quite cumbersome with 128-bit registers, using puncklqdq and punckhqdq to move 64-bit quantities between registers, and pshufd for moving data around within a register.
Maybe it would be better to copy data back and forth to 64-bit registers, but I seem to vaguely recall that moves between regular registers and xmm registers being slow, is that so? Copy + rotate + copy would be
movq %xmm0, %rax psrldq $1, %xmm0 rotq $1, %rax movq %xmm0, %rdx rotq $2, %rdx movq %rax, %xmm0 movq %rdx, %xmm1 puncklqdq %xmm1, %xmm0
for the same two rotations. One instruction less, but then the permutation can be done for free if applied before copying the data back to xmm registers. Is there some more efficient way to copy data back and forth between one xmm register and a pair of regular 64-bit registers?
One non-obvious trick in the current code which I believe is useful is to also transpose the matrix while permuting it. There are two benefits:
+ The resulting permutation (sha3 "pi" step + transpose) has several short cycles rather than a single long cycle, so it can be done in-place with fewer temporaries.
+ The next step, the nonlinear "chi" bit fiddling, gets the data in suitable form for SIMD operation, since it operates independently on each matrix row.
And one obvious drawback:
- The matrix has to be transposed again at the end of the iteration.
Regards, /Niels