Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

19 Jan 2022


      Maamoun TK maamoun.tk@googlemail.com writes:
...
The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64,
and 382.65% speedup for s390x.
OpenSSL is still ahead in terms of performance speed since it uses 4-way
interleaving or maybe more!!
Increasing the interleaving ways more than two has nothing to do with
parallelism since the execution units are already saturated by using 2-ways
for the three architectures. The reason behind the performance improvement
is the number of execution times of reduction procedure is cutted by half
for 4-way interleaving since the products of multiplying state parts by key
can be combined before the reduction phase. Let me know if you are
interested in doing that on nettle!
Interesting. I haven't paid much attention to the poly1305
implementation since it was added back in 2013. The C implementation
doesn't try to use wider multiplication than 32x32 --> 64, which is poor
for 64-bit platforms. Maybe we could use unsigned __int128 if we can
write a configure test to check if it is available and likely to be
efficient?
For most efficient interleaving, I take it one should precompute some
powers of the key, similar to how it's done in the recent gcm code?
...
It would be nice if the arm64 patch will be tested on big-endian mode since
I don't have access to any big-endian variant for testing.
Merged this one too on a branch for ci testing.
Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305