nisse@lysator.liu.se (Niels Möller) writes:
Y_2 B^2 + Y_1 B + Y_0 = (X_2 B^2 + X_1 B + X_0) (K_1 B + K_0) (mod P)
This can be arranged with 6 independent multiply instructions + cheap accumulation. (I haven't worked out the details for the ghash case, but I do expect that it's rather practival there too).
I've found a rather straight forward way to express that.
Recall that for ghash, due to the bit-reversal, the multiply operation of interest is
M H x^{-128} mod P
where structure of P means that x^{-64} = x^{64] + P_1, and P_1 is a single word. Split M and H into halves,
M = M_1 x^{64} + M_0 H = H_1 x^{64} + H_0
The previous notes defines the precomputation of
D_1 x^{64} + D_0 = H_0 x^{64} + H_1 + H_0 P_1
Alternatively, D can be defined as D = x^{-64} H. And the accumulation part can then be written as
(M_1 x^64 + M_0) H x^{-128} = (M_1 H + M_0 D) x^{-64}
As before, accumulate this in two 128-bit registers R and F, as
(M_1 x^64 + M_0) H x^{-128} = R + F x^{-64}
with
R = M_1 H_1 + M_0 D_1 F = M_1 H_0 + M_0 D_0
If we add one more unreduced word to M,
M = M_1 x^{64} + M_0 + M_{-1} x^{-64}
all we need is to precompute one more constant E = H x^{-128} = D x^{-64}, in the same way
E_1 x^{64} + E_0 = D x^{-64} = D_0 x^{64} + D_1 + D0 P_1
and we get one more term each for R and F,
R = M_1 H_1 + M_0 D_1 + M_{-1} E_1 F = M_1 H_0 + M_0 D_0 + M_{-1} E_2
At the end of the iteration, just add the high half of F into R, but keep F_0 as an input (the place of the M_{-1}) for the next iteration.
Regards, /Niels