Re: Salsa20

31 Mar 2012


      nisse@lysator.liu.se (Niels Möller) writes:
...
And now I've added support for salsa20 benchmarking as well. Performance
is pretty good, even for this reference implementation. Seems to run at
12 cycles per byte on my laptop, a little faster than 128-bit aes.
Now I've done some cleanup and reorganization work. Using memxor3 for
the actual xoring saves almost 3 cycles per byte, so it's a considerable
speedup in this case (and then memxor3 can maybe be sped up a bit more
on this platform, from Nikos' work a while ago).
I wonder if there's any potential for simd-style parallel processing
using plain 64-bit operations (no sse2 assembly). If one puts two 32-bit
values in a 64-bit variable, parallel xor is trivial, but we also need
32-bit rotations and adds.
Parallel rotations are possible as two shifts and some extra masking, so
it might be slightly faster than two 32-bit rotations of two shifts
each.
For parallel mod 2^32 additions, one need to cancel one of the carries.
Something like
uint64_t x, y, z;
z = x + y;
  z += (- (uint64_t) ((uint32_t) z < (uint32_t) y)) << 32;
This can hardly be faster than two 32-bit additions (two lea, cmp, cmov
on x86_64, or alternatively, one lea, cmp, sbb, shl, add, depending on
code genereration).
Any speedup depends on the relative speedup or slowdown of each
primitive, the cost of permuting values as needed, and then the effects
of reduced register pressure on top of that.
For x86_64 (and maybe x86_32), assembly implementation also seems
attractive, but I'd like to do any general or 64-bit specific
optimizations in the C code first.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Salsa20