nisse@lysator.liu.se (Niels Möller) writes:
So if we have the input in register A (loaded from memory with no processing besides ensuring proper *byte* order), and precompute two values, M representing b_1(x) x^64 + c_1(x), and L representing b_0(x) x^64 + d_1(x)), then we get the two halves above with two vpmsumd,
vpmsumd R, M, A vpmsumd F, L, A
When doing more than one block at a time, I think it's easiest to accumulate the R and F values separately.
BTW, I wonder if similar organization would make sense for Arm Neon. Now, Neon doesn't have vpmsumd, the widest carryless multiplication available is vmull.p8, which is an 8-bit to 15-bit multiply, 8 in parallel.
I'm sketching an instruction sequence doing the equivalent of two vpmsumd using 32 vmull.p8, with good parallelism and not too many instructions to shuffle around data to the right places. Is that a good idea? To be compared to what the C code does, a loop of 16 iterations, each doing some table lookup, shift and xoring.
With this large number of multiply instructions, it might pay off to use Karatsuba, which could reduce it to 24 multiples (one level) or 18 (two levels), at the cost of more xors and data movement instructions, and lots of complexity.
(There have been ARM Neon code for gcm posted to the list earlier, but if I remember correctly, that code didn't work in bit-reversed representation, but used a bunch of explicit reversal operations).
Regards, /Niels