Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

20 Jan 2022


      Maamoun TK maamoun.tk@googlemail.com writes:
...
Wider multiplication would improve the performance for 64-bit general
registers but as the case for the current SIMD implementation, the radix
2^26 fits well there.
If multiply throughput is the bottleneck, it makes sense to do as much
work as possible per multiply. So I don't think I understand the
benefits of interleaving, can you explain?
Let's consider the 64-bit case, since that's less writing. B = 2^64 as
usual. Then the state is
H = h_2 B^2 + h_1 B + h_0
(with h_2 rather small, depending on how far we normalize for each
block, lets assume at most 3 bits, or maybe even h_2 <= 4).
R = r_1 B + r_0
By the spec, high 4 bits of both r_0 and r_1, and low 2 bits of r_1 are
zero, which makes mutliplication R H (mod p) particularly nice.
We get
R H = r_0 h_0 + B (r_1 h_0 + r_0 h_1) 
      + B^2 (r_1 h_1 + r_0 h_2) + B^3 r_1 h_2
But then B^2 = 5/4 (mod p), and hence B^2 r_1 = 5 r_1 / 4 (mod p), where
the "/ 4" is just shifting out the two low zero bits. So let r_1' = 5
r_1 / 4,
R H = r_0 h_0 + r_1' h_1 + B (r_1 h_0 + r_0 h_1 + r_1' h_2 + B r_0 h_2)
These are 4 long multiplications (64x64 --> 128) and two short, 64x64
--> for the products involving h_2. (The 32-bit version would be 16 long
multiplications and 4 short).
...
From the zero high bits, we also get bounds on these terms,
f_0 = r_0 h_0 + r_1' h_1 < 2^124 + 5*2^122 = 9*2^122
f_1 = r_1 h_0 + r_0 h_1 + r_1' h_2 + B r_0 h_2
        < 2^125 + 5*2^61 + 2^127
So these two chains can be added together as 128-bit quantities with no
overflow, in any order, there's plendy of parallelism. E.g., power
vmsumudm might be useful.
For final folding, we need to split f_1 into top 62 and low 66 bits,
multiply low part by 5 (fits in 64 bits), and add into f_0, which still
fits in 128 bits.
And then take the top 64 bits of f_0 and add into f_1 (result <= 2^66
bits).
The current C implementation uses radix 26, and 25 multiplies (32x32
--> 64) per block. And quite a lot of shifts. A radix 32 variant
analogous to the above would need 16 long multiplies and 4 short. I'd
expect that to be faster on most machines, but I'd have to try that out.
In contrast, trying to use a similar scheme for multiplying by (r^2 (mod
p)), as needed for an interleaved version, seems more expensive. There
are several contributions to the cost:
* First, the accumulation of products by power of B needs to take into
  account carry, as result can exceed 2^128, so one would need something
  closer to general schoolbok multiplication.
* Second, since r^2 (mod p) may exceed 2^128, we need three words rather
  than two, so three more short multiplications to add in.
* Third, we can't pre-divide key words by 4, since low bits are no longer
  guaranteed to be zero. This gives more expensive reduction, with more
  multiplies by 5.
The two first points makes smaller radix more attractive; if we need
three words for both factors, we can distribute the bits to ensure some
of the most significant bits are zero.
...
Since the loop of block iteration is moved to inside the assembly
implementation, computing one multiple of key at the function prologue
should be ok.
For large messages, that's fine, but may add a significant cost for
messages of just two blocks.
Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305