Re: [PATCH] "PowerPC64" GCM support

24 Sep 2020


      ...
What's the speedup you get from assembly gcm_fill? I see the C
implementation uses memcpy and WRITE_UINT32, and is likely significantly
slower than the ctr_fill16 in ctr.c. But it could be improved using
portable means. If done well, it should be a very small fraction of the
cpu time spent for gcm encryption.
Actually, there is a counter involved along with writing operations, the
idea here is that the width of a single block is 16 bytes which can fit in
a vector register so one writing operation is needed plus executing e.g. 6
addition operations per cycle would boost the function performance. I
measured the execution time of both C and altivec implementations on POWER8
for 32,768 blocks (512 KB), repeated 10000 times and compiled with -O3
gcm_fill_c() took 0.000073 seconds to execute
gcm_fill_altivec() took 0.000019 seconds to execute
As you can see, the function itself isn't time consuming at all and maybe
optimizing it is not worth it, but gcm_fill is part of AES_CTR and what
other libraries usually do is optimizing AES_CTR as a whole so I considered
optimizing it to stay on the same track.
What table layout is used by the assembly gcm_init_key? I would have
...
expected not much tables to be needed when using the special multiply
instructions. And Nettle generally doesn't try very hard to optimize key
setup, under the theory that applications that need high performance
also do a lot of work with each key. E.g., we use C code for AES key
setup even when it could be sped up by assembly using special
instructions.
So it would be easier for me to start with a patch for gcm_hash only
(possibly with supporting C code for key setup).
It's a little bit more complicated than just special multiply instructions,
let me explain that
...
From reference [1]:
To compute the GHASH digest of 4 consecutive blocks, we use a method of
parallelization as described in References [3]. The method can be described
by the following equations:
Ciphertext inputs: C0, C1, C2, C3.
Digest input/output: Digest.
Digest = (((((((Digest ⊕C0)*H)⊕C1)*H)⊕C2)*H)⊕C3)*H
       = ((Digest⊕C0)*H^4)⊕(C1*H^3)⊕(C2*H^2)⊕(C3*H)
As you can see, there is a pre-computed constants here "H,H^2,H^3,H^3"
which should settle in a table to be used in upcoming hash operations,
furthermore to handle bit-reflection of the multiplication product, we
precompute "H << 1 modulo polynomial g(x)" then to pre-compute
"H^2,H^3,H^3" we use the operation:
reflected (A)*reflected (H<<1 mod g(x)) = reflected (A*H) mod g(x)
so we got H = H << 1 modulo polynomial g(x) then H^2 = H * H and so on. we
can use these values as 128-bit constants and avoid shifting the product.
Also in the init function we have to precompute inputs to the Karatsuba
Algorithm, so we adapt the previous constants to be compatible with the
Karatsuba Algorithm.
In summary, The table layout varies according to the techniques used in the
different implementations, it's unreasonable to fill certain table using C
as standard key setup, for example there are several x86 GCM
implementations in OpenSSL library and the table layouts of SSE and AVX
implementations are different from each other.
[1] https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/co...

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [PATCH] "PowerPC64" GCM support