Maamoun TK maamoun.tk@googlemail.com writes:
+L4x_loop:
[...]
- C polynomial multiplication "classical" pre-processing
- xxmrghd VSR(C23h),VSR(C2),VSR(C3)
- xxmrgld VSR(C23l),VSR(C2),VSR(C3)
- xxmrghd VSR(C01h),VSR(C0),VSR(C1)
- xxmrgld VSR(C01l),VSR(C0),VSR(C1)
- C polynomial multiplication "classical"
- vpmsumd C3,C3,H1 C M3 = H^1l*C3h⊕H^1h*C3l
- vpmsumd C2,C2,H2 C M2 = H^2l*C2h⊕H^2h*C2l
- vpmsumd C1,C1,H3 C M1 = H^3l*C1h⊕H^3h*C1l
- vpmsumd C0,C0,H4 C M0 = H^4l*C0h⊕H^4h*C0l
- vpmsumd C23h,C23h,H21h C H23 = H^2h*C2h⊕H^1h*C3h
- vpmsumd C23l,C23l,H21l C L23 = H^2l*C2l⊕H^1l*C3l
- vpmsumd C01h,C01h,H43h C H01 = H^4h*C0h⊕H^4h*C1h
- vpmsumd C01l,C01l,H43l C L01 = H^4l*C0l⊕H^4l*C1l
So you do 4 blocks with only 10 vpmsumd instructions (the 8 above, and 2 more below for the reduction). That's nice! It would be even nicer if the H terms could be rearranged to eliminate the pre-processing of the C values.
And it is all with polynomials bits reversed compared to the spec? I find the spec, https://nvlpubs.nist.gov/nistpubs/Legacy/SP/nistspecialpublication800-38d.pd..., not crystal clear, but as I understand it, the bit that is least significant when interpreted as a plain integer corresponds to the highest power of x, while instructions like vpmsumd use the opposite (and more natural, imo) convention.
This is how I understand bit reversal. We want
c(x) = a(x) b(x) mod p(x) = a(x) b(x) + q(x) p(x)
where a, b and c all are 128 bits, p is 129 bits, and q is 127 bits (chosen so the high 127 bits of the 255 bit product cancel). Reverse as
c'(x) = x^127 c(1/x), similarly for a', b',
q'(x) = x^126 q(1/x), p'(x) = x^128 p(1/x)
Then we get
c'(x) = [a'(x) b'(x) + q'(x) p'(x)] / x^127
i.e., q' now cancels the *low* 127 bits of the product. Which can also be written as
c'(x) = a'(x) b'(x) / x^127 (mod p'(x))
So in this way, bit reversal is a bit similar to montgomery representation for modular arithmetic on integers. And if 128 and corresponding bit boundary is more convenient than 127, one can either shift the product one bit left, or premultiply one of the factors with x mod p'(x).
Regards, /Niels