Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

23 Jan 2022


      nisse@lysator.liu.se (Niels Möller) writes:
...
The current C implementation uses radix 26, and 25 multiplies (32x32
--> 64) per block. And quite a lot of shifts. A radix 32 variant
analogous to the above would need 16 long multiplies and 4 short. I'd
expect that to be faster on most machines, but I'd have to try that out.
I've tried this out, see attached file. It has an #if 0/1 to choose
between radix 64 (depending on the non-standard __int128 type for
accumulated products) and radix 32 (portable C).
This is the speed I get for C implementations of poly1305_update on my
x86_64 laptop:
* Radix 26: 1.2 GByte/s (old code)
* Radix 32: 1.3 GByte/s
* Radix 64: 2.2 GByte/s
It would be interesting with benchmarks on actual 32-bit hardware,
32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
If I understood correctly, the suggestion to use radix 26 in djb's
original paper was motivated by a high-speed implementation using
floating point arithmetic (possibly in combination with SIMD), where the
product of two 26-bit integers can be represented exactly in an IEEE
double (but it gets a bit subtle if we want to accumulate several
products), I haven't really looked into implementing poly1305 with
either floating point or SIMD.
To improve test coverage, I've also extended poly1305 tests with tests
on random inputs, with results compared to a reference implementation
based on gmp/mini-gmp. I intend to merge those testing changes soon.
See
https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755....
Unfortunately, the http interface of the main git repo at Lysator is
inaccessible at the moment due to an expired certificate; should be
fixed in a day or two.
Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305