Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

24 Jan 2022


      nisse@lysator.liu.se (Niels Möller) writes:
...
This is the speed I get for C implementations of poly1305_update on my
x86_64 laptop:

Radix 26: 1.2 GByte/s (old code)

Radix 32: 1.3 GByte/s

Radix 64: 2.2 GByte/s


It would be interesting with benchmarks on actual 32-bit hardware,
32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
I've tried reworking folding, to reduce latency. Idea is to let the most
significant state word be close to a word, rather than limited to <= 4
as in the previous version. When multiplying by r, split one of the
multiplies to take out the low 2 bits. For the radix 64 version, that
term is
B^2 t_2 * r0
Split t_2 as 4*hi + lo, then this can be reduced to
B^2 lo * r0 + hi * 5*r0
(Using the same old B^2 = 5/4 (mod p) in a slightly different way).
The 5*r0 fits one word and can be precomputed, and then this
multiplication goes in parallell with the other multiplies, and no
multiply left in the final per-block folding. With this trick I get on
the same machine
Radix 32: 1.65 GByte/s
Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
I haven't yet done a strict analysis of bounds on the state and
temporaries, but I would expect that it works out with no possibility of
overflow.
See attached file. To fit the precomputed 5*r0 in a nice way I had to
rearrange the unions in struct poly1305_ctx a bit, I also attach the
patch to do this. Size of the struct should be the same, so I think it
can be done without any abi bump.
Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305