Maamoun TK maamoun.tk@googlemail.com writes:
On Sat, Jan 29, 2022 at 4:29 PM Niels Möller nisse@lysator.liu.se wrote:
** Interleaving **
The other approach, used in the recent powerpc gcm code, is to interleave multiple blocks. For simplicity, only consider 2-way interleaving here. The key thing is that if we expand te recurrency once, we get
R_j = K (M_j + K (R_{j-2} + M_{j-1})) = K M_j + K^2 (R_{j-2} + M_{j-1})
We get two field multiplications, but one of them, K M_j, is completely independent of previous blocks (R_{j-2}), and can be computed in parallel. It may add a cycle or so to accumulation latency, but we can do essentially twice as much work without making the critical path longer. We get 8 independent multiply instructions, and one dependent for the final folding.
But in radix 2^64, doesn't K^2 need to be represented in 3 words as described in a previous note so considering that we get 10 independent multiply instructions rather than 8?
For poly1305, yes. For gcm, I don't think so.
** Doing both **
It's possible to combine those two tricks. Processing of two blocks would then be an operation of the form
Z_2 B^2 + Z_1 B + Z_0 = (X_2 B^2 + X_1 B + X_0) K^2 + (Y_1 B + Y_0) K
Here, the Xs (three words) represent R_{j-2} + M_{j-1}, the Ys repreent M_{j-2}, and the Zs represents R_j, as three words (without final folding). We would need 10 independent multiples, one more than with plain interleaving, but critical path includes only one multiply latency.
Same matter here, if we consider 3 words for K^3 this sums up the number of independent multiples to 13.
For poly1305, yes. This seems to be an unavoidable cost of doing interleaving with poly1305: Structure of K is choosen to make multiplication particularly convenient, but that doesn't carry over to K^2. And we'll also have an X_3, of just a few bits. Unclear if it helps that most significant words are small; clearly multplying top 3 bits of X by top 3 bits of K^2 can use a short multiply (imul, on x86_64), but it seems to me that all other products need the more expensive 64x64->128 multiply instruction.
Also, if we have to use an extra word for poly1305, we could perhaps also explore going to radix 2^44, if hardware multipliers can support that.
Regards, /Niels