sha3 optimization

3 Dec 2012


      I've spent the evening on optimizing sha3. I've written a first basic
x86_64 implementation of sha3_permute, using 128-bit sse2 operations
where practical, which gave a 30% speedup. And after working with the
algorithm for a bit, I realized that it's fairly easy to reduce the
number of passes each round makes over the data. So I tried that in the
C implementation, and it gave a 20% speedup there.
So now when I understand the algorithm better, I guess I'll have to
rewrite the x86_64 implementation to reduce the number of passes there
too.
Still pretty slow, with sha3-256 at 32 cycles per byte at best. Which is
far behind the claimed performance of 6-7 cycles/byte on x86_64 (close
to the performance of sha1). I guess I should also have a look at that
code (or does aybody on the list know which tricks are relevant?).
The internal state is a 5x5 matrix of 64-bit words. If represented using
128-bit xmm registers, each row fits in two xmm registers and one
regular 64-bit register. The steps which seem hardest to fit with
128-bit instructions is the permutation and rotation steps, which appear
to want to treat 64-bit words individually.
The current code (both C and assembly) keeps the state in memory, using
double buffering, and doesn't try to fit more than one row at a time in
registers. Keeping more of the state in registers is highly desirable.
See http://git.lysator.liu.se/nettle/nettle/blobs/master/sha3-permute.c
and
http://git.lysator.liu.se/nettle/nettle/blobs/master/x86_64/sha3-permute.asm
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

sha3 optimization