GCM with ARM Neon (was: Re: [PATCH] "PowerPC64" GCM support)

11 Oct 2020


      nisse@lysator.liu.se (Niels Möller) writes:
...
So if we have the input in register A (loaded from memory with no
processing besides ensuring proper *byte* order), and precompute two
values, M representing b_1(x) x^64 + c_1(x), and L representing b_0(x)
x^64 + d_1(x)), then we get the two halves above with two vpmsumd,
vpmsumd R, M, A
  vpmsumd F, L, A
When doing more than one block at a time, I think it's easiest to
accumulate the R and F values separately.
BTW, I wonder if similar organization would make sense for Arm Neon.
Now, Neon doesn't have vpmsumd, the widest carryless multiplication
available is vmull.p8, which is an 8-bit to 15-bit multiply, 8 in
parallel.
I'm sketching an instruction sequence doing the equivalent of two
vpmsumd using 32 vmull.p8, with good parallelism and not too many
instructions to shuffle around data to the right places. Is that a good
idea? To be compared to what the C code does, a loop of 16 iterations,
each doing some table lookup, shift and xoring.
With this large number of multiply instructions, it might pay off to use
Karatsuba, which could reduce it to 24 multiples (one level) or 18 (two
levels), at the cost of more xors and data movement instructions, and
lots of complexity.
(There have been ARM Neon code for gcm posted to the list earlier, but if I
remember correctly, that code didn't work in bit-reversed representation,
but used a bunch of explicit reversal operations).
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

GCM with ARM Neon (was: Re: [PATCH] "PowerPC64" GCM support)