nisse@lysator.liu.se (Niels Möller) writes:
Maamoun TK maamoun.tk@googlemail.com writes:
I would like to explain more about 'vpmsumd' instruction, in x86 arch the 'pclmulqdq' instruction is used for carry-less operations. To use 'pclmulqdq' an immediate value should be passed to the third parameter of the instruction to specify which doublewords will be multiplied. However, 'vpmsumd' do the following operation: (High-order doubleword of the second parameter * High-order doubleword of the third parameter) XOR (Low-order doubleword of the second parameter * Low-order doubleword of the third parameter)
Interesting! Do you use inputs where one doubleword is zero, making one or the other of the xored values be zero, or is there some clever way to take advantage of the buiiltin wraparound? I guess one can also do some interesting things of other selected parts of the inputs zero, for example, the middle word of one of the operands, or all odd-numbered bits, or...
Ok, let me see if I can get the math right first (and ignore the issues if bit order and reversal). Thanks a lot for this the explanation of this curious instruction, I hope I have understood it corectly, because it seems really useful.
We have the polynomial p(x) = x^128 + x^7 + x^2 + x + 1 over Z_2. Which implies that
x^128 = x^2 + x + 1 (mod p(x))
so let's call this simpler polynomial p'(x) = x^2 + x + 1.
As input we have two polynomials, one depending on the key (i.e., invariant when processing a large message), represented as 128 bits each. Split them in 64-bit pieces we can reason about,
A = a_1(x) x^64 + a_0(x), B = b_1(x) x^64 + b_0(x),
with (unreduced) product C(x) = A(x) B(x), split in 64-bit pieces as
C(x) = c_3(x) x^192 + c_2(x) x^128 + c_1(x) x^64 + c_0 (x)
(where c_3 actually is only 63 bits). To reduce, we use x^128 = p'(x) (mod p(x)). We'd then need to multiply the high half with p'(x), say
D = (c_3(x) x^64 + c_2(x) ) p'(x) = d_2(x) x^128 + d_1(x) 2^64 + d_0(x)
where we get the high few bits in d_2(x). So we'd need to compute d_2 p'(x) again. And finally, we get the result C(x) mod p by summing/xoring
c_1(x) x^64 + c_0(x) d_1(x) x^64 + d_0(x) d2(x) p'(x) --------------------------
Now, let's look closer at the initial multiplication, A(x) B(x), which can be expanded as
a_1(x) b_1(x) x^128 + (a_1(x) b_0(x) + a_0(x) b_1(x)) x^64 + a_0(x) b_0(x)
I see two tricks: First, we can precompute a'_2(x) x^64 + a'_1(x) = a_1(x) x^128 (mod (p(x)), and replace the first term with
(a'_2(x) x^64 + a'_1(x)) b_1(x)
That should eliminate the annoying high bits d_2 in the reduction, at the cost one one more 64x64 multiply. I would expect that's the same number of operations, but more shallow dependency path. We get a 192-bit partially reduced result, instad of the full 256-bit result. Let's denote this product as
h_2(x) x^128 + h_1(x) x^64 + h_0(x) = (a'_2(x) x^64) + a'_1(x)) b_1(x)
Second, if we also prepare a register with the swapped A, a_0(x) x^64 + a_1(x), then the vpmsumd instruction, if I understand you correctly, can compute the middle term (a_1(x) b_0(x) + a_0(x) b_1(x)) in one operation. So we get down to 3 64x64 multiplies for free, without tghe reconstruction needed for standard Karatsuba multiplication. So if we compute the middle term like this,
m_1(x) x^64 + m_0(x) = a_1(x) b_0(x) + a_0(x) b_1(x)
The part we need to fold is m_1(x) + h_2(x) then we get the fully reduced result by summing
a_0(x) b_0(x) (127 bits, one vpmsumd with one input half zero) m_0(x) (64 bits) h_1(x) h_0(x) (128 bits) (m_1(x) + h_2(x))p'(x) (70 or so bits)
If I count it correctly, we get the fully reduced result with 5 vpmsumd multiplicatiions, one for a_0 b_0, one for the midle terms, two for the miltiplication with a', and one for the only explicit multiplication with p'(x). And if we do multiple blocks in parallel, they can share share this final multiply.
There are sure further tricks, e.g., it seems the a_0(x) b_0(x) multiply can share an vpmsumd instruction with the a'_1(x) b_1(x) multiply, if we just prepare a register with a'_1(x) x^64 + a_0(x); then the replacement of a_1 with the twice as large a' values seems for free. One could also attempt to let operations corresponding to separate input blocks share vpmsumd instructions.
Regards, /Niels