Re: sha3 optimization

13 Dec 2012


      nisse@lysator.liu.se (Niels Möller) writes:
...
So now when I understand the algorithm better, I guess I'll have to
rewrite the x86_64 implementation to reduce the number of passes there
too.
I've rewritten it now. Gives only a slight speedup over the C code,
andruns at roughly 16.5 cycles per byte, compared to 8 for sha1, 18 for
sha256, and 12 for sha512. There surely are microptimizations left to do
with all the juggling around of halves of xmm registers, but I doubt one
can get down below 10 cycles with the current organization of the code.
...
The steps which seem hardest to fit with
128-bit instructions is the permutation and rotation steps, which appear
to want to treat 64-bit words individually.
For rotations, the problem is that we need to rotate the two halves of
an xmm register with different rotation counts. And then there are no
rotate instructions, only shifts. So to rotate the two halves of %xmm0
by one and two bits (and not even trying to combine them back into the
same register), one needs to do something like
movdqa	%xmm0, %xmm1
  movdqa	%xmm0, %xmm2
  movdqa	%xmm0, %xmm3
  psllq		$1, %xmm0
  psrlq		$63, %xmm1
  psllq		$2, %xmm2
  psrlq		$62, %xmm3
  por		%xmm1, %xmm0
  por		%xmm3, %xmm2
which is an awful lot of instructions. Is there any simpler way to do
it?
And the permutations are also quite cumbersome with 128-bit registers,
using puncklqdq and punckhqdq to move 64-bit quantities between
registers, and pshufd for moving data around within a register.
Maybe it would be better to copy data back and forth to 64-bit
registers, but I seem to vaguely recall that moves between regular
registers and xmm registers being slow, is that so? Copy + rotate + copy
would be
movq		%xmm0, %rax
  psrldq	$1, %xmm0
  rotq		$1, %rax
  movq		%xmm0, %rdx
  rotq		$2, %rdx
  movq		%rax, %xmm0
  movq		%rdx, %xmm1
  puncklqdq	%xmm1, %xmm0
for the same two rotations. One instruction less, but then the
permutation can be done for free if applied before copying the data back
to xmm registers. Is there some more efficient way to copy data back and
forth between one xmm register and a pair of regular 64-bit registers?
One non-obvious trick in the current code which I believe is useful is
to also transpose the matrix while permuting it. There are two benefits:
+ The resulting permutation (sha3 "pi" step + transpose) has several
  short cycles rather than a single long cycle, so it can be done
  in-place with fewer temporaries.
+ The next step, the nonlinear "chi" bit fiddling, gets the data in
  suitable form for SIMD operation, since it operates independently on
  each matrix row.
And one obvious drawback:
- The matrix has to be transposed again at the end of the iteration.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: sha3 optimization