On Sun, Oct 11, 2020 at 1:42 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
So if we have the input in register A (loaded from memory with no processing besides ensuring proper *byte* order), and precompute two values, M representing b_1(x) x^64 + c_1(x), and L representing b_0(x) x^64 + d_1(x)), then we get the two halves above with two vpmsumd,
vpmsumd R, M, A vpmsumd F, L, A
When doing more than one block at a time, I think it's easiest to accumulate the R and F values separately.
BTW, I wonder if similar organization would make sense for Arm Neon. Now, Neon doesn't have vpmsumd, the widest carryless multiplication available is vmull.p8, which is an 8-bit to 15-bit multiply, 8 in parallel...
I may be mistaken, but I believe 64-bit poly multiplies are available. Or they are available on Aarch64 with Crypto extensions.
I'm not aware of poly multiplies on other ARM arches, like ARMv6 or ARMv7 with NEON.
Jeff