Re: Latency in polynomial evaluation

29 Jan 2022


      nisse@lysator.liu.se (Niels Möller) writes:
...
Y_2 B^2 + Y_1 B + Y_0 = (X_2 B^2 + X_1 B + X_0) (K_1 B + K_0)  (mod P)
This can be arranged with 6 independent multiply instructions + cheap
accumulation. (I haven't worked out the details for the ghash case, but
I do expect that it's rather practival there too).
I've found a rather straight forward way to express that.
Recall that for ghash, due to the bit-reversal, the
multiply operation of interest is
M H x^{-128} mod P
where structure of P means that x^{-64} = x^{64] + P_1, and P_1 is a
single word. Split M and H into halves,
M = M_1 x^{64} + M_0 
  H = H_1 x^{64} + H_0
The previous notes defines the precomputation of
D_1 x^{64} + D_0 = H_0 x^{64} + H_1 + H_0 P_1
Alternatively, D can be defined as D = x^{-64} H. And the accumulation
part can then be written as
(M_1 x^64 + M_0) H x^{-128} = (M_1 H + M_0 D) x^{-64}
As before, accumulate this in two 128-bit registers R and F, as
(M_1 x^64 + M_0) H x^{-128} = R + F x^{-64}
with
R = M_1 H_1 + M_0 D_1
  F = M_1 H_0 + M_0 D_0
If we add one more unreduced word to M,
M = M_1 x^{64} + M_0 + M_{-1} x^{-64}
all we need is to precompute one more constant E = H x^{-128} = D
x^{-64}, in the same way
E_1 x^{64} + E_0 = D x^{-64} = D_0 x^{64} + D_1 + D0 P_1
and we get one more term each for R and F,
R = M_1 H_1 + M_0 D_1 + M_{-1} E_1
  F = M_1 H_0 + M_0 D_0 + M_{-1} E_2
At the end of the iteration, just add the high half of F into R, but
keep F_0 as an input (the place of the M_{-1}) for the next iteration.
Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Latency in polynomial evaluation