I've spent the evening on optimizing sha3. I've written a first basic x86_64 implementation of sha3_permute, using 128-bit sse2 operations where practical, which gave a 30% speedup. And after working with the algorithm for a bit, I realized that it's fairly easy to reduce the number of passes each round makes over the data. So I tried that in the C implementation, and it gave a 20% speedup there.
So now when I understand the algorithm better, I guess I'll have to rewrite the x86_64 implementation to reduce the number of passes there too.
Still pretty slow, with sha3-256 at 32 cycles per byte at best. Which is far behind the claimed performance of 6-7 cycles/byte on x86_64 (close to the performance of sha1). I guess I should also have a look at that code (or does aybody on the list know which tricks are relevant?).
The internal state is a 5x5 matrix of 64-bit words. If represented using 128-bit xmm registers, each row fits in two xmm registers and one regular 64-bit register. The steps which seem hardest to fit with 128-bit instructions is the permutation and rotation steps, which appear to want to treat 64-bit words individually.
The current code (both C and assembly) keeps the state in memory, using double buffering, and doesn't try to fit more than one row at a time in registers. Keeping more of the state in registers is highly desirable.
See http://git.lysator.liu.se/nettle/nettle/blobs/master/sha3-permute.c and http://git.lysator.liu.se/nettle/nettle/blobs/master/x86_64/sha3-permute.asm
Regards, /Niels