Re: [PATCH] "PowerPC64" GCM support

5 Oct 2020


      nisse@lysator.liu.se (Niels Möller) writes:
...
Maamoun TK maamoun.tk@googlemail.com writes:
...
+L4x_loop:
[...]
...

C polynomial multiplication "classical" pre-processing
xxmrghd        VSR(C23h),VSR(C2),VSR(C3)
xxmrgld        VSR(C23l),VSR(C2),VSR(C3)
xxmrghd        VSR(C01h),VSR(C0),VSR(C1)
xxmrgld        VSR(C01l),VSR(C0),VSR(C1)

C polynomial multiplication "classical"
vpmsumd        C3,C3,H1                      C M3 = H^1l*C3h⊕H^1h*C3l
vpmsumd        C2,C2,H2                      C M2 = H^2l*C2h⊕H^2h*C2l
vpmsumd        C1,C1,H3                      C M1 = H^3l*C1h⊕H^3h*C1l
vpmsumd        C0,C0,H4                      C M0 = H^4l*C0h⊕H^4h*C0l
vpmsumd        C23h,C23h,H21h                C H23 = H^2h*C2h⊕H^1h*C3h
vpmsumd        C23l,C23l,H21l                C L23 = H^2l*C2l⊕H^1l*C3l
vpmsumd        C01h,C01h,H43h                C H01 = H^4h*C0h⊕H^4h*C1h
vpmsumd        C01l,C01l,H43l                C L01 = H^4l*C0l⊕H^4l*C1l

So you do 4 blocks with only 10 vpmsumd instructions (the 8 above, and 2
more below for the reduction). That's nice! It would be even nicer if
the H terms could be rearranged to eliminate the pre-processing of the C
values.
I think I've found a slightly better way to organize it. I'm assuming
bit reversal and that we have pre-multiplied the key factor by x, so
that we need to compute
A(x) B(x) / x^128 (mod P(x))
where P(x) is the reversed polynomial P(x) = x^128 + x^127 + x^126 +
x^121 + 1. Denote the 64-bit pieces as
A(x) = a_1(x) x^64 + a_0(x), B(x) = b_1(x) x^64 + b_0(x)
We do precomputations on B (the key, invariant when processing a message
of multiple blocks).
First, compute b_0(x) / x^64 (mod P(x)), which expands it from 64 bits to
128,
c_1(x) x^64 + c_0(x) = b_0(x) / x^64 (mod P(x))
Next, add together d_1 = b_1 + c_0. We can then write (to simplify
notation, everything below is (mod P(x)), + means xor, and we also omit
the (x) arguments).
A B / x^128 = a_1 b_1 + (a_0 b_1 + a_1 b_0) / x^64 + a_0 b_0 / x^128
              = a_1 b_1 + (a_0 b_1 + a_1 b_0) / x^64 + a_0 (c_1 x^64 + c_0) / x^64
              = a_1 b_1 + a_0 c_1 + (a_0 b_1 + a_1 b_0 + a_0 c_0) / 2^64
              = a_1 b_1 + a_0 c_1 + (a_0 d_1 + a_1 b_0) / x^64
                `------R--------'    `------F--------'
So if we have the input in register A (loaded from memory with no
processing besides ensuring proper *byte* order), and precompute two
values, M representing b_1(x) x^64 + c_1(x), and L representing b_0(x)
x^64 + d_1(x)), then we get the two halves above with two vpmsumd,
vpmsumd R, M, A
  vpmsumd F, L, A
When doing more than one block at a time, I think it's easiest to
accumulate the R and F values separately.
After accumulation, we have the pieces
R = r_1 x^64 + r_0, F = f_1 x^64 + f_0
To do the division with x^64 (mod P), and reduce to a 128-bit result. we
need to cancel f_0, and we do that by adding multiplyign f_0 P and
adding in:
(f_1 x^64 + f_0 + f_0 P) / 2^64 = f_1 + f_0 (x^64 + x^63 + x^62 + x^57)
If we prepare a register G representing the constant polynomial (x^63 +
x^62 + x^57), i.e, all zero high part, we can do that as
vpmsumd T, F, G
and then to get the final result, we need to xor together the three
values
r_1 x^64 + r_0
   f_0 x^64 + f_1   (F swapped)
   t_1 x^64 + t_0
It may be worth noting that both r_1 and t_1 are only 63 bits, so it's
only the contribution of f_0 in the high half position that increases
the size of the result from 127 bits to 128.
There are some possible variations of the final reduction, e.g, one
could use G' = x^63 + x^62 + x^61 + x^56 (i.e., one more term), but then
left shift the result of f_0 G'. With some different and really clever
handling of the extra x factor (here, assumed premultiplied into b),
maybe one could use a single vpmsumd to multiply f_0 G' and add in f_1
at the right place before shifting.
So to sum up, with this method, there's no preprocessing of the input,
two vpmsumd per block to produce the values to be accumulated (same as
in your patch), and a single vpmsumd (one less than in your patch) for
the final reduction.
Furthermore, there's actually no need to do the final reduction until
the end of the message. If we can keep the accumulated R and M values
separately in the gcm_ctx struct (an ABI change, though), the final
reduction can be postponed until gcm_digest is called.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [PATCH] "PowerPC64" GCM support