On Tue, 2021-11-30 at 20:17 +0100, Niels Möller wrote:
Amitay Isaacs amitay@ozlabs.org writes:
+ .file "ecc-secp192r1-modp.asm"
Thanks, I'm looking at this file first (being the simplest, even though the security level of this curve is a bit low for current usage, so performance not of so great importance).
Yes. The main focus was on curves p256 and higher. For completeness sake I added the assembly for p192 and p224 curves.
I'm quite new to powerpc, so I'm refering to the instruction reference, and trying to learn as we go along. It seems addc is addition with carry output (but no carry input), adde is addition with carry input and output, and addze is addition of zero with carry input and output.
+define(`RP', `r4') +define(`XP', `r5')
+define(`T0', `r6') +define(`T1', `r7') +define(`T2', `r8') +define(`T3', `r9') +define(`C1', `r10') +define(`C2', `r11')
As I understand it, we could also use register r3 (unused input argument), but we don't need to, since we have enough free scratch registers.
Generally I avoided using r3 as it is the first input argument, but also serves as the return value of the function. So unless we are running short of registers r3 is left untouched. (Even though there are 32 registers, only some of them can be used without saving them first.) For example, r3 is used in p224 modp implementation.
+ C void ecc_secp192r1_modp (const struct ecc_modulo *m, mp_limb_t *rp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp192r1_modp) + ld T0, 0(XP) + ld T1, 8(XP) + ld T2, 16(XP)
+ li C1, 0 + li C2, 0
+ ld T3, 24(XP) + addc T0, T3, T0 + adde T1, T3, T1 + addze T2, T2 + addze C1, C1
+ ld T3, 32(XP) + addc T1, T3, T1 + adde T2, T3, T2 + addze C1, C1
+ ld T3, 40(XP) + addc T0, T3, T0 + adde T1, T3, T1 + adde T2, T3, T2 + addze C1, C1
To analyze what we are doing, I'm using the Nettle and GMP convention that B = 2^64 (bignum base), then p = B^3 - B - 1, or B^3 = B + 1 (mod p). Denote the six input words as
<a_5,a_4,a_3,a_2,a_1,a_0>
representing the number
B^5 a_5 + B^4 a_4 + B^3 a_3 + B^2 a_2 + B a_1 + a_0
The accumulation above, as I understand it, computes
<c_1,t_2,t_1,t_0> = <a_2,a_1,a_0> + a_3 (B+1) + a_4 (B^2 + B) + a_5 (B^2 + B + 1>
or more graphically,
a_2 a_2 a_1 a_3 a_3 a_4 a_4 + a_5 a_5 a_5 --------------- c_1 t_2 t_1 t_0
This number is < 3 B^3, which means that c_1 is 0, 1 or 2 (each of the addze instructions can increment it).
This looks nice, and I think it is pretty efficient too. It looks a bit different from what the x86_64 code is doing; maybe the latter could be improved.
+ addc T0, C1, T0 + adde T1, C1, T1 + addze T2, T2 + addze C2, C2
Above, c_1 is folded in at the right places,
<c_2,t_2,t_1,t_0> <-- <t_2, t_1, t_0> + c_1 (B + 1)
This number is < B^3 + 3 (B+1). This implies that in the (quite unlikely) case we get carry out, i.e., c_2 = 1, then the value of the low three words is < 3 (B+1). That means that there can be no new carry out when folding c_2.
+ li C1, 0 + addc T0, C2, T0 + adde T1, C2, T1 + addze T2, T2 + addze C1, C1
+ addc T0, C1, T0 + adde T1, C1, T1 + addze T2, T2
So I think this final folding could be reduced to just
addc T0, C2, T0 adde T1, C2, T1 addze T2, T2
There's no carry out, from this, because either C2 was zero, or T2 was small, <= 3. Does that make sense?
This was the first code I wrote using the exact calculation you have outlined above. However, I was not sure about the sizes of the carry (C1 and C2). I did notice the x86 code short-cutting the C2 folding, but the reasoning was not apparent.
Thank you for explaining the bounds calculation. That does help me understand why C2 folding can be simplified as you have suggested.
+ std T0, 0(RP) + std T1, 8(RP) + std T2, 16(RP)
+ blr +EPILOGUE(_nettle_ecc_secp192r1_modp)
Regards, /Niels
Amitay.