I've spent the evening on optimizing sha3. I've written a first basic x86_64 implementation of sha3_permute, using 128-bit sse2 operations where practical, which gave a 30% speedup. And after working with the algorithm for a bit, I realized that it's fairly easy to reduce the number of passes each round makes over the data. So I tried that in the C implementation, and it gave a 20% speedup there.
So now when I understand the algorithm better, I guess I'll have to rewrite the x86_64 implementation to reduce the number of passes there too.
Still pretty slow, with sha3-256 at 32 cycles per byte at best. Which is far behind the claimed performance of 6-7 cycles/byte on x86_64 (close to the performance of sha1). I guess I should also have a look at that code (or does aybody on the list know which tricks are relevant?).
The internal state is a 5x5 matrix of 64-bit words. If represented using 128-bit xmm registers, each row fits in two xmm registers and one regular 64-bit register. The steps which seem hardest to fit with 128-bit instructions is the permutation and rotation steps, which appear to want to treat 64-bit words individually.
The current code (both C and assembly) keeps the state in memory, using double buffering, and doesn't try to fit more than one row at a time in registers. Keeping more of the state in registers is highly desirable.
See http://git.lysator.liu.se/nettle/nettle/blobs/master/sha3-permute.c and http://git.lysator.liu.se/nettle/nettle/blobs/master/x86_64/sha3-permute.asm
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
So now when I understand the algorithm better, I guess I'll have to rewrite the x86_64 implementation to reduce the number of passes there too.
I've rewritten it now. Gives only a slight speedup over the C code, andruns at roughly 16.5 cycles per byte, compared to 8 for sha1, 18 for sha256, and 12 for sha512. There surely are microptimizations left to do with all the juggling around of halves of xmm registers, but I doubt one can get down below 10 cycles with the current organization of the code.
The steps which seem hardest to fit with 128-bit instructions is the permutation and rotation steps, which appear to want to treat 64-bit words individually.
For rotations, the problem is that we need to rotate the two halves of an xmm register with different rotation counts. And then there are no rotate instructions, only shifts. So to rotate the two halves of %xmm0 by one and two bits (and not even trying to combine them back into the same register), one needs to do something like
movdqa %xmm0, %xmm1 movdqa %xmm0, %xmm2 movdqa %xmm0, %xmm3 psllq $1, %xmm0 psrlq $63, %xmm1 psllq $2, %xmm2 psrlq $62, %xmm3 por %xmm1, %xmm0 por %xmm3, %xmm2
which is an awful lot of instructions. Is there any simpler way to do it?
And the permutations are also quite cumbersome with 128-bit registers, using puncklqdq and punckhqdq to move 64-bit quantities between registers, and pshufd for moving data around within a register.
Maybe it would be better to copy data back and forth to 64-bit registers, but I seem to vaguely recall that moves between regular registers and xmm registers being slow, is that so? Copy + rotate + copy would be
movq %xmm0, %rax psrldq $1, %xmm0 rotq $1, %rax movq %xmm0, %rdx rotq $2, %rdx movq %rax, %xmm0 movq %rdx, %xmm1 puncklqdq %xmm1, %xmm0
for the same two rotations. One instruction less, but then the permutation can be done for free if applied before copying the data back to xmm registers. Is there some more efficient way to copy data back and forth between one xmm register and a pair of regular 64-bit registers?
One non-obvious trick in the current code which I believe is useful is to also transpose the matrix while permuting it. There are two benefits:
+ The resulting permutation (sha3 "pi" step + transpose) has several short cycles rather than a single long cycle, so it can be done in-place with fewer temporaries.
+ The next step, the nonlinear "chi" bit fiddling, gets the data in suitable form for SIMD operation, since it operates independently on each matrix row.
And one obvious drawback:
- The matrix has to be transposed again at the end of the iteration.
Regards, /Niels
On 12/13/2012 10:31 AM, Niels Möller wrote:
Maybe it would be better to copy data back and forth to 64-bit registers, but I seem to vaguely recall that moves between regular registers and xmm registers being slow, is that so?
I don't think you can ever know given the large number of architectures and designs. If you remember in the memxor using the SSE2 registers gave a 10x boost in intel processors was running at 0.9x the original speed in the AMD ones.
I'd say just try it on some stock processors and see what's best on average (I can test code on i7 if you need).
regards, Nikos
nettle-bugs@lists.lysator.liu.se