nisse@lysator.liu.se (Niels Möller) writes:
Maamoun TK maamoun.tk@googlemail.com writes:
+L4x_loop:
[...]
- C polynomial multiplication "classical" pre-processing
- xxmrghd VSR(C23h),VSR(C2),VSR(C3)
- xxmrgld VSR(C23l),VSR(C2),VSR(C3)
- xxmrghd VSR(C01h),VSR(C0),VSR(C1)
- xxmrgld VSR(C01l),VSR(C0),VSR(C1)
- C polynomial multiplication "classical"
- vpmsumd C3,C3,H1 C M3 = H^1l*C3h⊕H^1h*C3l
- vpmsumd C2,C2,H2 C M2 = H^2l*C2h⊕H^2h*C2l
- vpmsumd C1,C1,H3 C M1 = H^3l*C1h⊕H^3h*C1l
- vpmsumd C0,C0,H4 C M0 = H^4l*C0h⊕H^4h*C0l
- vpmsumd C23h,C23h,H21h C H23 = H^2h*C2h⊕H^1h*C3h
- vpmsumd C23l,C23l,H21l C L23 = H^2l*C2l⊕H^1l*C3l
- vpmsumd C01h,C01h,H43h C H01 = H^4h*C0h⊕H^4h*C1h
- vpmsumd C01l,C01l,H43l C L01 = H^4l*C0l⊕H^4l*C1l
So you do 4 blocks with only 10 vpmsumd instructions (the 8 above, and 2 more below for the reduction). That's nice! It would be even nicer if the H terms could be rearranged to eliminate the pre-processing of the C values.
I think I've found a slightly better way to organize it. I'm assuming bit reversal and that we have pre-multiplied the key factor by x, so that we need to compute
A(x) B(x) / x^128 (mod P(x))
where P(x) is the reversed polynomial P(x) = x^128 + x^127 + x^126 + x^121 + 1. Denote the 64-bit pieces as
A(x) = a_1(x) x^64 + a_0(x), B(x) = b_1(x) x^64 + b_0(x)
We do precomputations on B (the key, invariant when processing a message of multiple blocks).
First, compute b_0(x) / x^64 (mod P(x)), which expands it from 64 bits to 128,
c_1(x) x^64 + c_0(x) = b_0(x) / x^64 (mod P(x))
Next, add together d_1 = b_1 + c_0. We can then write (to simplify notation, everything below is (mod P(x)), + means xor, and we also omit the (x) arguments).
A B / x^128 = a_1 b_1 + (a_0 b_1 + a_1 b_0) / x^64 + a_0 b_0 / x^128 = a_1 b_1 + (a_0 b_1 + a_1 b_0) / x^64 + a_0 (c_1 x^64 + c_0) / x^64 = a_1 b_1 + a_0 c_1 + (a_0 b_1 + a_1 b_0 + a_0 c_0) / 2^64 = a_1 b_1 + a_0 c_1 + (a_0 d_1 + a_1 b_0) / x^64 `------R--------' `------F--------'
So if we have the input in register A (loaded from memory with no processing besides ensuring proper *byte* order), and precompute two values, M representing b_1(x) x^64 + c_1(x), and L representing b_0(x) x^64 + d_1(x)), then we get the two halves above with two vpmsumd,
vpmsumd R, M, A vpmsumd F, L, A
When doing more than one block at a time, I think it's easiest to accumulate the R and F values separately.
After accumulation, we have the pieces
R = r_1 x^64 + r_0, F = f_1 x^64 + f_0
To do the division with x^64 (mod P), and reduce to a 128-bit result. we need to cancel f_0, and we do that by adding multiplyign f_0 P and adding in:
(f_1 x^64 + f_0 + f_0 P) / 2^64 = f_1 + f_0 (x^64 + x^63 + x^62 + x^57)
If we prepare a register G representing the constant polynomial (x^63 + x^62 + x^57), i.e, all zero high part, we can do that as
vpmsumd T, F, G
and then to get the final result, we need to xor together the three values
r_1 x^64 + r_0 f_0 x^64 + f_1 (F swapped) t_1 x^64 + t_0
It may be worth noting that both r_1 and t_1 are only 63 bits, so it's only the contribution of f_0 in the high half position that increases the size of the result from 127 bits to 128.
There are some possible variations of the final reduction, e.g, one could use G' = x^63 + x^62 + x^61 + x^56 (i.e., one more term), but then left shift the result of f_0 G'. With some different and really clever handling of the extra x factor (here, assumed premultiplied into b), maybe one could use a single vpmsumd to multiply f_0 G' and add in f_1 at the right place before shifting.
So to sum up, with this method, there's no preprocessing of the input, two vpmsumd per block to produce the values to be accumulated (same as in your patch), and a single vpmsumd (one less than in your patch) for the final reduction.
Furthermore, there's actually no need to do the final reduction until the end of the message. If we can keep the accumulated R and M values separately in the gcm_ctx struct (an ABI change, though), the final reduction can be postponed until gcm_digest is called.
Regards, /Niels