Re: Latency in polynomial evaluation

7 Feb 2022


      Maamoun TK maamoun.tk@googlemail.com writes:
...
On Sat, Jan 29, 2022 at 4:29 PM Niels Möller nisse@lysator.liu.se wrote:
...
** Interleaving **
The other approach, used in the recent powerpc gcm code, is to
interleave multiple blocks. For simplicity, only consider 2-way
interleaving here. The key thing is that if we expand te recurrency
once, we get
R_j = K (M_j + K (R_{j-2} + M_{j-1}))
      = K M_j + K^2 (R_{j-2} + M_{j-1})
We get two field multiplications, but one of them, K M_j, is completely
independent of previous blocks (R_{j-2}), and can be computed in
parallel. It may add a cycle or so to accumulation latency, but we can
do essentially twice as much work without making the critical path
longer. We get 8 independent multiply instructions, and one dependent
for the final folding.
But in radix 2^64, doesn't K^2 need to be represented in 3 words as
described in a previous note so considering that we get 10 independent
multiply instructions rather than 8?
For poly1305, yes. For gcm, I don't think so.
...
...
** Doing both **
It's possible to combine those two tricks. Processing of two blocks
would then be an operation of the form
Z_2 B^2 + Z_1 B + Z_0
    = (X_2 B^2 + X_1 B + X_0) K^2 + (Y_1 B + Y_0) K
Here, the Xs (three words) represent R_{j-2} + M_{j-1}, the Ys repreent
M_{j-2}, and the Zs represents R_j, as three words (without final folding).
We would need 10 independent multiples, one more than with plain
interleaving, but critical path includes only one multiply latency.
Same matter here, if we consider 3 words for K^3 this sums up the number of
independent multiples to 13.
For poly1305, yes. This seems to be an unavoidable cost of doing
interleaving with poly1305: Structure of K is choosen to make
multiplication particularly convenient, but that doesn't carry over to
K^2. And we'll also have an X_3, of just a few bits. Unclear if it helps
that most significant words are small; clearly multplying top 3 bits of
X by top 3 bits of K^2 can use a short multiply (imul, on x86_64), but
it seems to me that all other products need the more expensive
64x64->128 multiply instruction.
Also, if we have to use an extra word for poly1305, we could perhaps
also explore going to radix 2^44, if hardware multipliers can support
that.
Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Latency in polynomial evaluation