I'm looking at the skein code I wrote a while ago, and I might merge it pretty soon, just need a little cleanup. I'm doing skein256 and skein512 (other variants, in particular, skein512-256, mey be of interest).
For skein256, it works quite well with two-way unrolling (by which I mean that each loop iteration performs 8 mixing rounds, adding in subkeys twice).
I did an x86_64 assembly version, but I'm not being able to beat C code compiled by gcc, so I think I'll scrap that. Which isn't so surprising, since skein uses only operations that the C compiler knows well, and I'm not trying anything clever with scheduling or register allocation.
For skein512, subkeys are accessed with mod 9 indexing, which is challenging to do with high performance, if indexes need to be constructed at runtime. I get pretty good performance with full unrolling (so indices are constant), and I'm afraid we have to do either that, or copy subkeys into an area where they are repeated multiple times.
As I think I wrote earlier, skein looks similar in spirit to salsa20 and chacha, but unlike those, it's doesn't fit well with simd instructions. To use simd instructions for skein, one would like to put put 2 64-bit values in one xmm register, but one then needs a way to rotate the two halves with different shift counts, which I haven't found any good way to do. Also the odd number of subkeys (five for skein256, and nine for skein512, with the last subkey being the xor of all the other keys and a magic constant), usued in a rotationg fashion, doesn't fit well with storing keys in simd registers.
I'm benchmarking on a intel broadwell cpu (marketing name "core i3-5010U") running at 2.1 GHz.
Algorithm mode Mbyte/s cycles/byte cycles/block skein256 update 242.45 8.26 264.33 skein512 update 350.44 5.71 365.75
For comparison, timing for sha1, sha2 and sha3:
Algorithm mode Mbyte/s cycles/byte cycles/block sha1 update 326.40 6.14 392.69 openssl sha1 update 560.98 3.57 228.48 sha256 update 156.07 12.83 821.24 sha512 update 252.54 7.93 1015.10 sha3_224 update 161.90 12.37 1781.33 sha3_256 update 152.83 13.10 1782.18 sha3_384 update 117.06 17.11 1779.22 sha3_512 update 80.52 24.87 1790.77
So skein512 is faster than both sha2 and sha3 (and one can also see that for sha1 we currently lose to openssl). skein256 is fastet than sha3, but slightly slower than sha512. So maybe we shouldn't do skein256 at all, but skein512-256 (skein can be used with arbitrary output size).
Code size for is 408 bytes for skein256, and 3992 bytes for skein512 (which is completely unrolled). Counting only the main block processing function.
Regards, /Niels