I'm looking at the skein code I wrote a while ago, and I might merge it
pretty soon, just need a little cleanup. I'm doing skein256 and skein512
(other variants, in particular, skein512-256, mey be of interest).
For skein256, it works quite well with two-way unrolling (by which I
mean that each loop iteration performs 8 mixing rounds, adding in
subkeys twice).
I did an x86_64 assembly version, but I'm not being able to beat C code
compiled by gcc, so I think I'll scrap that. Which isn't so surprising,
since skein uses only operations that the C compiler knows well, and I'm
not trying anything clever with scheduling or register allocation.
For skein512, subkeys are accessed with mod 9 indexing, which is
challenging to do with high performance, if indexes need to be
constructed at runtime. I get pretty good performance with full
unrolling (so indices are constant), and I'm afraid we have to do either
that, or copy subkeys into an area where they are repeated multiple
times.
As I think I wrote earlier, skein looks similar in spirit to salsa20 and
chacha, but unlike those, it's doesn't fit well with simd instructions.
To use simd instructions for skein, one would like to put put 2 64-bit
values in one xmm register, but one then needs a way to rotate the two
halves with different shift counts, which I haven't found any good way
to do. Also the odd number of subkeys (five for skein256, and nine for
skein512, with the last subkey being the xor of all the other keys and a
magic constant), usued in a rotationg fashion, doesn't fit well with
storing keys in simd registers.
I'm benchmarking on a intel broadwell cpu (marketing name "core
i3-5010U") running at 2.1 GHz.
Algorithm mode Mbyte/s cycles/byte cycles/block
skein256 update 242.45 8.26 264.33
skein512 update 350.44 5.71 365.75
For comparison, timing for sha1, sha2 and sha3:
Algorithm mode Mbyte/s cycles/byte cycles/block
sha1 update 326.40 6.14 392.69
openssl sha1 update 560.98 3.57 228.48
sha256 update 156.07 12.83 821.24
sha512 update 252.54 7.93 1015.10
sha3_224 update 161.90 12.37 1781.33
sha3_256 update 152.83 13.10 1782.18
sha3_384 update 117.06 17.11 1779.22
sha3_512 update 80.52 24.87 1790.77
So skein512 is faster than both sha2 and sha3 (and one can also see that
for sha1 we currently lose to openssl). skein256 is fastet than sha3,
but slightly slower than sha512. So maybe we shouldn't do skein256 at
all, but skein512-256 (skein can be used with arbitrary output size).
Code size for is 408 bytes for skein256, and 3992 bytes for skein512
(which is completely unrolled). Counting only the main block processing
function.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.