Hi,
This series of patches add the powerpc64 assembly for modp/redc functions for elliptic curves P192, P224, P256, P384, P521, X25519 and X448. It results in 15-30% performance improvements as measured on POWER9 system using hogweed-benchmark.
master ------ name size sign/ms verify/ms
ecdsa 192 13.5465 3.8313 ecdsa 224 8.4079 2.5289 ecdsa 256 8.5772 2.4691 ecdsa 384 3.4361 0.9765 ecdsa 521 2.2918 0.6785
eddsa 255 15.0672 4.1138 eddsa 448 4.7115 1.3067 curve 255 15.5206 6.4745 curve 448 4.7906 2.0458
this series ----------- name size sign/ms verify/ms
ecdsa 192 17.5450 5.6564 ecdsa 224 10.8847 3.7482 ecdsa 256 11.0208 3.5215 ecdsa 384 5.1133 1.6133 ecdsa 521 2.8815 0.9348
eddsa 255 17.4936 4.9027 eddsa 448 6.0401 1.7075 curve 255 18.1034 7.6884 curve 448 6.1756 2.7095
Amitay Isaacs (3): ecc: Add powerpc64 assembly for ecc_192_modp ecc: Add powerpc64 assembly for ecc_224_modp ecc: Add powerpc64 assembly for ecc_256_redc
Martin Schwenke (4): ecc: Add powerpc64 assembly for ecc_384_modp ecc: Add powerpc64 assembly for ecc_521_modp ecc: Add powerpc64 assembly for ecc_25519_modp ecc: Add powerpc64 assembly for ecc_448_modp
powerpc64/ecc-curve25519-modp.asm | 103 ++++++++++++++ powerpc64/ecc-curve448-modp.asm | 174 +++++++++++++++++++++++ powerpc64/ecc-secp192r1-modp.asm | 93 ++++++++++++ powerpc64/ecc-secp224r1-modp.asm | 123 ++++++++++++++++ powerpc64/ecc-secp256r1-redc.asm | 144 +++++++++++++++++++ powerpc64/ecc-secp384r1-modp.asm | 227 ++++++++++++++++++++++++++++++ powerpc64/ecc-secp521r1-modp.asm | 166 ++++++++++++++++++++++ 7 files changed, 1030 insertions(+) create mode 100644 powerpc64/ecc-curve25519-modp.asm create mode 100644 powerpc64/ecc-curve448-modp.asm create mode 100644 powerpc64/ecc-secp192r1-modp.asm create mode 100644 powerpc64/ecc-secp224r1-modp.asm create mode 100644 powerpc64/ecc-secp256r1-redc.asm create mode 100644 powerpc64/ecc-secp384r1-modp.asm create mode 100644 powerpc64/ecc-secp521r1-modp.asm
Signed-off-by: Amitay Isaacs amitay@ozlabs.org --- powerpc64/ecc-secp192r1-modp.asm | 93 ++++++++++++++++++++++++++++++++ 1 file changed, 93 insertions(+) create mode 100644 powerpc64/ecc-secp192r1-modp.asm
diff --git a/powerpc64/ecc-secp192r1-modp.asm b/powerpc64/ecc-secp192r1-modp.asm new file mode 100644 index 00000000..97c71a83 --- /dev/null +++ b/powerpc64/ecc-secp192r1-modp.asm @@ -0,0 +1,93 @@ +C powerpc64/ecc-secp192r1-modp.asm + +ifelse(` + Copyright (C) 2021 Amitay Isaacs, IBM Corporation + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + .file "ecc-secp192r1-modp.asm" + +define(`RP', `r4') +define(`XP', `r5') + +define(`T0', `r6') +define(`T1', `r7') +define(`T2', `r8') +define(`T3', `r9') +define(`C1', `r10') +define(`C2', `r11') + + C void ecc_secp192r1_modp (const struct ecc_modulo *m, mp_limb_t *rp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp192r1_modp) + ld T0, 0(XP) + ld T1, 8(XP) + ld T2, 16(XP) + + li C1, 0 + li C2, 0 + + ld T3, 24(XP) + addc T0, T3, T0 + adde T1, T3, T1 + addze T2, T2 + addze C1, C1 + + ld T3, 32(XP) + addc T1, T3, T1 + adde T2, T3, T2 + addze C1, C1 + + ld T3, 40(XP) + addc T0, T3, T0 + adde T1, T3, T1 + adde T2, T3, T2 + addze C1, C1 + + addc T0, C1, T0 + adde T1, C1, T1 + addze T2, T2 + addze C2, C2 + + li C1, 0 + addc T0, C2, T0 + adde T1, C2, T1 + addze T2, T2 + addze C1, C1 + + addc T0, C1, T0 + adde T1, C1, T1 + addze T2, T2 + + std T0, 0(RP) + std T1, 8(RP) + std T2, 16(RP) + + blr +EPILOGUE(_nettle_ecc_secp192r1_modp)
Amitay Isaacs amitay@ozlabs.org writes:
- .file "ecc-secp192r1-modp.asm"
Thanks, I'm looking at this file first (being the simplest, even though the security level of this curve is a bit low for current usage, so performance not of so great importance).
I'm quite new to powerpc, so I'm refering to the instruction reference, and trying to learn as we go along. It seems addc is addition with carry output (but no carry input), adde is addition with carry input and output, and addze is addition of zero with carry input and output.
+define(`RP', `r4') +define(`XP', `r5')
+define(`T0', `r6') +define(`T1', `r7') +define(`T2', `r8') +define(`T3', `r9') +define(`C1', `r10') +define(`C2', `r11')
As I understand it, we could also use register r3 (unused input argument), but we don't need to, since we have enough free scratch registers.
- C void ecc_secp192r1_modp (const struct ecc_modulo *m, mp_limb_t *rp)
- .text
+define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp192r1_modp)
- ld T0, 0(XP)
- ld T1, 8(XP)
- ld T2, 16(XP)
- li C1, 0
- li C2, 0
- ld T3, 24(XP)
- addc T0, T3, T0
- adde T1, T3, T1
- addze T2, T2
- addze C1, C1
- ld T3, 32(XP)
- addc T1, T3, T1
- adde T2, T3, T2
- addze C1, C1
- ld T3, 40(XP)
- addc T0, T3, T0
- adde T1, T3, T1
- adde T2, T3, T2
- addze C1, C1
To analyze what we are doing, I'm using the Nettle and GMP convention that B = 2^64 (bignum base), then p = B^3 - B - 1, or B^3 = B + 1 (mod p). Denote the six input words as
<a_5,a_4,a_3,a_2,a_1,a_0>
representing the number
B^5 a_5 + B^4 a_4 + B^3 a_3 + B^2 a_2 + B a_1 + a_0
The accumulation above, as I understand it, computes
<c_1,t_2,t_1,t_0> = <a_2,a_1,a_0> + a_3 (B+1) + a_4 (B^2 + B) + a_5 (B^2 + B + 1>
or more graphically,
a_2 a_2 a_1 a_3 a_3 a_4 a_4 + a_5 a_5 a_5 --------------- c_1 t_2 t_1 t_0
This number is < 3 B^3, which means that c_1 is 0, 1 or 2 (each of the addze instructions can increment it).
This looks nice, and I think it is pretty efficient too. It looks a bit different from what the x86_64 code is doing; maybe the latter could be improved.
- addc T0, C1, T0
- adde T1, C1, T1
- addze T2, T2
- addze C2, C2
Above, c_1 is folded in at the right places,
<c_2,t_2,t_1,t_0> <-- <t_2, t_1, t_0> + c_1 (B + 1)
This number is < B^3 + 3 (B+1). This implies that in the (quite unlikely) case we get carry out, i.e., c_2 = 1, then the value of the low three words is < 3 (B+1). That means that there can be no new carry out when folding c_2.
- li C1, 0
- addc T0, C2, T0
- adde T1, C2, T1
- addze T2, T2
- addze C1, C1
- addc T0, C1, T0
- adde T1, C1, T1
- addze T2, T2
So I think this final folding could be reduced to just
addc T0, C2, T0 adde T1, C2, T1 addze T2, T2
There's no carry out, from this, because either C2 was zero, or T2 was small, <= 3. Does that make sense?
- std T0, 0(RP)
- std T1, 8(RP)
- std T2, 16(RP)
- blr
+EPILOGUE(_nettle_ecc_secp192r1_modp)
Regards, /Niels
On Tue, 2021-11-30 at 20:17 +0100, Niels Möller wrote:
Amitay Isaacs amitay@ozlabs.org writes:
+ .file "ecc-secp192r1-modp.asm"
Thanks, I'm looking at this file first (being the simplest, even though the security level of this curve is a bit low for current usage, so performance not of so great importance).
Yes. The main focus was on curves p256 and higher. For completeness sake I added the assembly for p192 and p224 curves.
I'm quite new to powerpc, so I'm refering to the instruction reference, and trying to learn as we go along. It seems addc is addition with carry output (but no carry input), adde is addition with carry input and output, and addze is addition of zero with carry input and output.
+define(`RP', `r4') +define(`XP', `r5')
+define(`T0', `r6') +define(`T1', `r7') +define(`T2', `r8') +define(`T3', `r9') +define(`C1', `r10') +define(`C2', `r11')
As I understand it, we could also use register r3 (unused input argument), but we don't need to, since we have enough free scratch registers.
Generally I avoided using r3 as it is the first input argument, but also serves as the return value of the function. So unless we are running short of registers r3 is left untouched. (Even though there are 32 registers, only some of them can be used without saving them first.) For example, r3 is used in p224 modp implementation.
+ C void ecc_secp192r1_modp (const struct ecc_modulo *m, mp_limb_t *rp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp192r1_modp) + ld T0, 0(XP) + ld T1, 8(XP) + ld T2, 16(XP)
+ li C1, 0 + li C2, 0
+ ld T3, 24(XP) + addc T0, T3, T0 + adde T1, T3, T1 + addze T2, T2 + addze C1, C1
+ ld T3, 32(XP) + addc T1, T3, T1 + adde T2, T3, T2 + addze C1, C1
+ ld T3, 40(XP) + addc T0, T3, T0 + adde T1, T3, T1 + adde T2, T3, T2 + addze C1, C1
To analyze what we are doing, I'm using the Nettle and GMP convention that B = 2^64 (bignum base), then p = B^3 - B - 1, or B^3 = B + 1 (mod p). Denote the six input words as
<a_5,a_4,a_3,a_2,a_1,a_0>
representing the number
B^5 a_5 + B^4 a_4 + B^3 a_3 + B^2 a_2 + B a_1 + a_0
The accumulation above, as I understand it, computes
<c_1,t_2,t_1,t_0> = <a_2,a_1,a_0> + a_3 (B+1) + a_4 (B^2 + B) + a_5 (B^2 + B + 1>
or more graphically,
a_2 a_2 a_1 a_3 a_3 a_4 a_4 + a_5 a_5 a_5 --------------- c_1 t_2 t_1 t_0
This number is < 3 B^3, which means that c_1 is 0, 1 or 2 (each of the addze instructions can increment it).
This looks nice, and I think it is pretty efficient too. It looks a bit different from what the x86_64 code is doing; maybe the latter could be improved.
+ addc T0, C1, T0 + adde T1, C1, T1 + addze T2, T2 + addze C2, C2
Above, c_1 is folded in at the right places,
<c_2,t_2,t_1,t_0> <-- <t_2, t_1, t_0> + c_1 (B + 1)
This number is < B^3 + 3 (B+1). This implies that in the (quite unlikely) case we get carry out, i.e., c_2 = 1, then the value of the low three words is < 3 (B+1). That means that there can be no new carry out when folding c_2.
+ li C1, 0 + addc T0, C2, T0 + adde T1, C2, T1 + addze T2, T2 + addze C1, C1
+ addc T0, C1, T0 + adde T1, C1, T1 + addze T2, T2
So I think this final folding could be reduced to just
addc T0, C2, T0 adde T1, C2, T1 addze T2, T2
There's no carry out, from this, because either C2 was zero, or T2 was small, <= 3. Does that make sense?
This was the first code I wrote using the exact calculation you have outlined above. However, I was not sure about the sizes of the carry (C1 and C2). I did notice the x86 code short-cutting the C2 folding, but the reasoning was not apparent.
Thank you for explaining the bounds calculation. That does help me understand why C2 folding can be simplified as you have suggested.
+ std T0, 0(RP) + std T1, 8(RP) + std T2, 16(RP)
+ blr +EPILOGUE(_nettle_ecc_secp192r1_modp)
Regards, /Niels
Amitay.
Here's the updated code for P192 curve after simplifying C2 folding.
Amitay.
Signed-off-by: Amitay Isaacs amitay@ozlabs.org --- powerpc64/ecc-secp224r1-modp.asm | 123 +++++++++++++++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 powerpc64/ecc-secp224r1-modp.asm
diff --git a/powerpc64/ecc-secp224r1-modp.asm b/powerpc64/ecc-secp224r1-modp.asm new file mode 100644 index 00000000..e993376f --- /dev/null +++ b/powerpc64/ecc-secp224r1-modp.asm @@ -0,0 +1,123 @@ +C powerpc64/ecc-secp224r1-modp.asm + +ifelse(` + Copyright (C) 2021 Amitay Isaacs, IBM Corporation + + Based on x86_64/ecc-secp224r1-modp.asm + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + .file "ecc-secp224r1-modp.asm" + +define(`SP', `r1') + +define(`RP', `r4') +define(`XP', `r5') + +define(`T0', `r6') +define(`T1', `r7') +define(`H0', `r8') +define(`H1', `r9') +define(`H2', `r10') +define(`F0', `r11') +define(`F1', `r12') +define(`F2', `r14') +define(`T2', `r3') + + C void ecc_secp224r1_modp (const struct ecc_modulo *m, mp_limb_t *rp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp224r1_modp) + std F2, -8(SP) + + ld H0, 48(XP) + ld H1, 56(XP) + C set (F2, F1, F0) <-- (H1, H0) << 32 + sldi F0, H0, 32 + srdi F1, H0, 32 + sldi T0, H1, 32 + srdi F2, H1, 32 + or F1, T0, F1 + + li H2, 0 + ld T0, 16(XP) + ld T1, 24(XP) + subfc T0, F0, T0 + subfe T1, F1, T1 + subfe H0, F2, H0 + addme H1, H1 + + ld T2, 32(XP) + addc H0, T2, H0 + ld T2, 40(XP) + adde H1, T2, H1 + addze H2, H2 + + C Set (F2, F1, F0) <-- (H2, H1, H0) << 32 + sldi F0, H0, 32 + srdi F1, H0, 32 + addc H0, T0, H0 + sldi T0, H1, 32 + srdi F2, H1, 32 + adde H1, T1, H1 + sldi T1, H2, 32 + addze H2, H2 + or F1, T0, F1 + or F2, T1, F2 + + ld T0, 0(XP) + ld T1, 8(XP) + subfc T0, F0, T0 + subfe T1, F1, T1 + subfe H0, F2, H0 + addme H1, H1 + addme H2, H2 + + srdi F0, H1, 32 + sldi F1, H2, 32 + or F0, F1, F0 + clrrdi F1, H1, 32 + mr F2, H2 + clrldi H1, H1, 32 + + subfc T0, F0, T0 + addme F1, F1 + addme F2, F2 + addc T1, F1, T1 + adde H0, F2, H0 + addze H1, H1 + + std T0, 0(RP) + std T1, 8(RP) + std H0, 16(RP) + std H1, 24(RP) + + ld F2, -8(SP) + + blr +EPILOGUE(_nettle_ecc_secp224r1_modp)
Updated version using actual register names for storing and restoring from stack.
Amitay.
Signed-off-by: Amitay Isaacs amitay@ozlabs.org Signed-off-by: Martin Schwenke martin@meltin.net --- powerpc64/ecc-secp256r1-redc.asm | 144 +++++++++++++++++++++++++++++++ 1 file changed, 144 insertions(+) create mode 100644 powerpc64/ecc-secp256r1-redc.asm
diff --git a/powerpc64/ecc-secp256r1-redc.asm b/powerpc64/ecc-secp256r1-redc.asm new file mode 100644 index 00000000..59447567 --- /dev/null +++ b/powerpc64/ecc-secp256r1-redc.asm @@ -0,0 +1,144 @@ +C powerpc64/ecc-secp256r1-redc.asm + +ifelse(` + Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation + + Based on x86_64/ecc-secp256r1-redc.asm + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + +C Register usage: + +define(`SP', `r1') + +define(`RP', `r4') +define(`XP', `r5') + +define(`F0', `r3') +define(`F1', `r6') +define(`F2', `r7') +define(`F3', `r8') + +define(`U0', `r9') +define(`U1', `r10') +define(`U2', `r11') +define(`U3', `r12') +define(`U4', `r14') +define(`U5', `r15') +define(`U6', `r16') +define(`U7', `r17') + + .file "ecc-secp256r1-redc.asm" + +C FOLD(x), sets (F3,F2,F1,F0) <-- [(x << 224) - (x << 192) - (x << 96)] >> 64 +define(`FOLD', ` + sldi F2, $1, 32 + srdi F3, $1, 32 + li F0, 0 + li F1, 0 + subfc F0, F2, F0 + subfe F1, F3, F1 + subfe F2, $1, F2 + addme F3, F3 +') + + C void ecc_secp256r1_redc (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp256r1_redc) + + std U4,-32(SP) + std U5,-24(SP) + std U6,-16(SP) + std U7,-8(SP) + + ld U0, 0(XP) + ld U1, 8(XP) + ld U2, 16(XP) + ld U3, 24(XP) + ld U4, 32(XP) + ld U5, 40(XP) + ld U6, 48(XP) + ld U7, 56(XP) + + FOLD(U0) + subfc U1, F0, U1 + subfe U2, F1, U2 + subfe U3, F2, U3 + subfe U0, F3, U0 + + FOLD(U1) + subfc U2, F0, U2 + subfe U3, F1, U3 + subfe U4, F2, U4 + subfe U1, F3, U1 + + FOLD(U2) + subfc U3, F0, U3 + subfe U4, F1, U4 + subfe U5, F2, U5 + subfe U2, F3, U2 + + FOLD(U3) + subfc U4, F0, U4 + subfe U5, F1, U5 + subfe U6, F2, U6 + subfe U3, F3, U3 + + addc U0, U4, U0 + adde U1, U5, U1 + adde U2, U6, U2 + adde U3, U7, U3 + + C If carry, we need to add in + C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> + li F0, 0 + addze F0, F0 + neg F2, F0 + sldi F1, F2, 32 + srdi F3, F2, 32 + li U7, -2 + and F3, F3, U7 + + addc U0, F0, U0 + adde U1, F1, U1 + adde U2, F2, U2 + adde U3, F3, U3 + + std U0, 0(RP) + std U1, 8(RP) + std U2, 16(RP) + std U3, 24(RP) + + ld U4,-32(SP) + ld U5,-24(SP) + ld U6,-16(SP) + ld U7,-8(SP) + + blr +EPILOGUE(_nettle_ecc_secp256r1_redc)
Amitay Isaacs amitay@ozlabs.org writes:
--- /dev/null +++ b/powerpc64/ecc-secp256r1-redc.asm @@ -0,0 +1,144 @@ +C powerpc64/ecc-secp256r1-redc.asm +ifelse(`
- Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation
- Based on x86_64/ecc-secp256r1-redc.asm
Looks good, and it seems method follows the x86_64 version closely. I just checked in a correction and a clarification to the comments to the x86_64 version.
A few comments below.
+C Register usage:
+define(`SP', `r1')
+define(`RP', `r4') +define(`XP', `r5')
+define(`F0', `r3') +define(`F1', `r6') +define(`F2', `r7') +define(`F3', `r8')
+define(`U0', `r9') +define(`U1', `r10') +define(`U2', `r11') +define(`U3', `r12') +define(`U4', `r14') +define(`U5', `r15') +define(`U6', `r16') +define(`U7', `r17')
One could save one register by letting U7 and XP overlap, since XP isn't used after loading U7.
- .file "ecc-secp256r1-redc.asm"
+C FOLD(x), sets (F3,F2,F1,F0) <-- [(x << 224) - (x << 192) - (x << 96)] >> 64 +define(`FOLD', `
- sldi F2, $1, 32
- srdi F3, $1, 32
- li F0, 0
- li F1, 0
- subfc F0, F2, F0
- subfe F1, F3, F1
I think the
li F0, 0 li F1, 0 subfc F0, F2, F0 subfe F1, F3, F1
could be replaced with
subfic F0, F2, 0 C "negate with borrow" subfze F1, F3
If that is measurably faster, I can't say.
Another option: Since powerpc, like arm, seems to use the proper two's complement convention that "borrow" is not carry, maybe we don't need to negate to F0 and F1 at all, and instead change the later subtraction, replacing
subfc U1, F0, U1 subfe U2, F1, U2 subfe U3, F2, U3 subfe U0, F3, U0
with
addc U1, F0, U1 adde U2, F1, U2 subfe U3, F2, U3 subfe U0, F3, U0
I haven't thought that through, but it does make some sense to me. I think the arm code propagates carry through a mix of add and sub instructions in a some places. Maybe F2 needs to be incremented somewhere for this to work, but probably still cheaper. If this works, FOLD would turn into something like
sldi F0, $1, 32 srdi F1, $1, 32 subfc F2, $1, F0 addme F3, F1
(If you want to investigate this later on, that's fine too, I could merge the code with the current folding logic).
- C If carry, we need to add in
- C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1>
- li F0, 0
- addze F0, F0
- neg F2, F0
- sldi F1, F2, 32
- srdi F3, F2, 32
- li U7, -2
- and F3, F3, U7
I think the three instructions to set F3 could be replaced with
srdi F3, F2, 31 sldi F3, F3, 1
Or maybe the and operation is faster than shift?
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
If this works, FOLD would turn into something like
sldi F0, $1, 32 srdi F1, $1, 32 subfc F2, $1, F0 addme F3, F1
I'm looking at a different approach (experimenting on ARM64, which is quite similar to powerpc, but I don't yet have working code). To understand what the redc code is doing we need to keep in mind that what one folding step does is to compute
<U4,U3,U2,U1,U0> + U0*p
which cancels the low limb, since p = -1 (mod 2^64). So since the low limb always cancel, what we need is
<U4,U3,U2,U1> + U0*((p+1)/2^64)
The x86_64 code does this by splitting U0*p into 2^{256} U0 - (2^{256} - p) * U0, subtracting in the folding step, and adding in the high part later. But one doesn't have to do it that way. One could instead use a FOLD macro that computes
(2^{192} - 2^{160} + 2^{128} + 2^{32}) U0
I also wonder of there's some way to use carry out from one fold step and apply it at the right place while preparing the F0,F1,F2,F3 for the next step.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
I'm looking at a different approach (experimenting on ARM64, which is quite similar to powerpc, but I don't yet have working code). To understand what the redc code is doing we need to keep in mind that what one folding step does is to compute
<U4,U3,U2,U1,U0> + U0*p
which cancels the low limb, since p = -1 (mod 2^64). So since the low limb always cancel, what we need is
<U4,U3,U2,U1> + U0*((p+1)/2^64)
The x86_64 code does this by splitting U0*p into 2^{256} U0 - (2^{256} - p) * U0, subtracting in the folding step, and adding in the high part later. But one doesn't have to do it that way. One could instead use a FOLD macro that computes
(2^{192} - 2^{160} + 2^{128} + 2^{32}) U0
I also wonder of there's some way to use carry out from one fold step and apply it at the right place while preparing the F0,F1,F2,F3 for the next step.
I've got this working now, attaching the version with early carry folding. Also checked in on the branch arm64-ecc. The preceding commit (5ee0839bb28c092044fce09534651b78640518c4) collects carries and adds them in as a separate pass over the data.
I've tested it only with Tested only with qemu-aarch64, help with benchmarking on real arm64 hardware appreciated (just add the file in the arm64/ directory and run ./config.status --recheck && ./config.status have the build pick it up).
I think the approach should apply to other 64-bit archs (should probably work also on x86_64, where it's sometimes tricky to avoid x86_64 instructions clobbering the carry flag when it should be preserved, but probably not so difficult in this case).
C arm64/ecc-secp256r1-redc.asm
ifelse(` Copyright (C) 2013, 2021 Niels Möller
This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
.file "ecc-secp256r1-redc.asm"
define(`RP', `x1') define(`XP', `x2')
define(`U0', `x0') C Overlaps unused modulo input define(`U1', `x3') define(`U2', `x4') define(`U3', `x5') define(`U4', `x6') define(`U5', `x7') define(`U6', `x8') define(`U7', `x9') define(`F0', `x10') define(`F1', `x11') define(`F2', `x12') define(`F3', `x13') define(`ZERO', `x14')
C FOLD(x), sets (F3, F2,F1,F0 ) <-- (x << 192) - (x << 160) + (x << 128) + (x << 32) define(`FOLD', ` lsl F0, $1, #32 lsr F1, $1, #32 subs F2, $1, F0 sbc F3, $1, F1 ')
C FOLDC(x), sets (F3, F2,F1,F0) <-- ((x+c) << 192) - (x << 160) + (x << 128) + (x << 32) define(`FOLDC', ` lsl F0, $1, #32 lsr F1, $1, #32 adc F3, $1, ZERO C May overflow, but final result will not. subs F2, $1, F0 sbc F3, F3, F1 ')
PROLOGUE(_nettle_ecc_secp256r1_redc) ldr U0, [XP] ldr U1, [XP, #8] ldr U2, [XP, #16] ldr U3, [XP, #24] ldr U4, [XP, #32] ldr U5, [XP, #40] ldr U6, [XP, #48] ldr U7, [XP, #56] mov ZERO, #0
FOLD(U0) adds U1, U1, F0 adcs U2, U2, F1 adcs U3, U3, F2 adcs U4, U4, F3
FOLDC(U1) adds U2, U2, F0 adcs U3, U3, F1 adcs U4, U4, F2 adcs U5, U5, F3
FOLDC(U2) adds U3, U3, F0 adcs U4, U4, F1 adcs U5, U5, F2 adcs U6, U6, F3
FOLDC(U3) adds U4, U4, F0 adcs U5, U5, F1 adcs U6, U6, F2 adcs U7, U7, F3
C Sum, including carry, is < 2^{256} + p. C If carry, we need to add in 2^{256} mod p = 2^{256} - p C = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> C and this addition can not overflow. adc F0, ZERO, ZERO neg F2, F0 lsl F1, F2, #32 lsr F3, F2, #32 and F3, F3, #-2
adds U0, F0, U4 adcs U1, F1, U5 adcs U2, F2, U6 adc U3, F3, U7
str U0, [RP] str U1, [RP, #8] str U2, [RP, #16] str U3, [RP, #24]
ret EPILOGUE(_nettle_ecc_secp256r1_redc)
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
I think the approach should apply to other 64-bit archs (should probably work also on x86_64, where it's sometimes tricky to avoid x86_64 instructions clobbering the carry flag when it should be preserved, but probably not so difficult in this case).
x86_64 version below. I could also trimmed register usage, so it no longer needs to save and restore any registers. On my machine, this gives a speedup of 17% for ecc_secp256r1_redc in isolation, 3% speedup for ecdsa sign and 7% speedup of ecdsa verify.
Regards, /Niels
C x86_64/ecc-secp256r1-redc.asm
ifelse(` Copyright (C) 2013, 2021 Niels Möller
This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
.file "ecc-secp256r1-redc.asm"
define(`RP', `%rsi') define(`XP', `%rdx')
define(`U0', `%rdi') C Overlaps unused modulo input define(`U1', `%rcx') define(`U2', `%rax') define(`U3', `%r8') define(`F0', `%r9') define(`F1', `%r10') define(`F2', `%r11') define(`F3', `%rdx') C Overlap XP, used only in final carry folding
C FOLD(x), sets (x,F2,F1,F0 ) <-- (x << 192) - (x << 160) + (x << 128) + (x << 32) define(`FOLD', ` mov $1, F0 mov $1, F1 mov $1, F2 shl `$'32, F0 shr `$'32, F1 sub F0, F2 sbb F1, $1 ') C FOLDC(x), sets (x,F2,F1,F0) <-- ((x+c) << 192) - (x << 160) + (x << 128) + (x << 32) define(`FOLDC', ` mov $1, F0 mov $1, F1 mov $1, F2 adc `$'0, $1 shl `$'32, F0 shr `$'32, F1 sub F0, F2 sbb F1, $1 ') PROLOGUE(_nettle_ecc_secp256r1_redc) W64_ENTRY(3, 0)
mov (XP), U0 FOLD(U0) mov 8(XP), U1 mov 16(XP), U2 mov 24(XP), U3 add F0, U1 adc F1, U2 adc F2, U3 adc 32(XP), U0
FOLDC(U1) add F0, U2 adc F1, U3 adc F2, U0 adc 40(XP), U1
FOLDC(U2) add F0, U3 adc F1, U0 adc F2, U1 adc 48(XP), U2
FOLDC(U3) add F0, U0 adc F1, U1 adc F2, U2 adc 56(XP), U3
C Sum, including carry, is < 2^{256} + p. C If carry, we need to add in 2^{256} mod p = 2^{256} - p C = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> C and this addition can not overflow. sbb F2, F2 mov F2, F0 mov F2, F1 mov XREG(F2), XREG(F3) neg F0 shl $32, F1 and $-2, XREG(F3)
add F0, U0 mov U0, (RP) adc F1, U1 mov U1, 8(RP) adc F2, U2 mov U2, 16(RP) adc F3, U3
mov U3, 24(RP)
W64_EXIT(3, 0) ret EPILOGUE(_nettle_ecc_secp256r1_redc)
Hi Niels,
On Mon, 2021-12-06 at 22:29 +0100, Niels Möller wrote:
nisse@lysator.liu.se (Niels Möller) writes:
I think the approach should apply to other 64-bit archs (should probably work also on x86_64, where it's sometimes tricky to avoid x86_64 instructions clobbering the carry flag when it should be preserved, but probably not so difficult in this case).
x86_64 version below. I could also trimmed register usage, so it no longer needs to save and restore any registers. On my machine, this gives a speedup of 17% for ecc_secp256r1_redc in isolation, 3% speedup for ecdsa sign and 7% speedup of ecdsa verify.
On POWER9, the new code gives ~20% speedup for ecc_secp256r1_redc in isolation, and ~1% speedup for ecdsa sign and verify over the earlier assembly version.
Amitay.
Amitay Isaacs amitay@ozlabs.org writes:
On POWER9, the new code gives ~20% speedup for ecc_secp256r1_redc in isolation, and ~1% speedup for ecdsa sign and verify over the earlier assembly version.
Thanks! Merged to master-updates for ci testing.
I think it should be possible to reduce number of needed registers, and completely avoid using callee-save registers (load the values now in U4-U7 one at a time a bit closer to the place where they are needed in), and replace F3 with $1 in the FOLD and FOLDC macros.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Thanks! Merged to master-updates for ci testing.
And now merged to the master branch.
I think it should be possible to reduce number of needed registers, and completely avoid using callee-save registers (load the values now in U4-U7 one at a time a bit closer to the place where they are needed in), and replace F3 with $1 in the FOLD and FOLDC macros.
Attaching a variant to do this. Passes tests with qemu, but I haven't benchmarked it on any real hardware.
C powerpc64/ecc-secp256r1-redc.asm
ifelse(` Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation
Based on x86_64/ecc-secp256r1-redc.asm
This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
C Register usage:
define(`RP', `r4') define(`XP', `r5')
define(`F0', `r3') define(`F1', `r6') define(`F2', `r7') define(`T', `r8')
define(`U0', `r9') define(`U1', `r10') define(`U2', `r11') define(`U3', `r12')
.file "ecc-secp256r1-redc.asm"
C FOLD(x), sets (x,F2,F1,F0) <-- [(x << 192) - (x << 160) + (x << 128) + (x <<32)] define(`FOLD', ` sldi F0, $1, 32 srdi F1, $1, 32 subfc F2, F0, $1 subfe $1, F1, $1 ')
C FOLDC(x), sets (x,F2,F1,F0) <-- [((x+c) << 192) - (x << 160) + (x << 128) + (x <<32)] define(`FOLDC', ` sldi F0, $1, 32 srdi F1, $1, 32 addze T, $1 subfc F2, F0, $1 subfe $1, F1, T ')
C void ecc_secp256r1_redc (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) .text define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_ecc_secp256r1_redc)
ld U0, 0(XP) ld U1, 8(XP) ld U2, 16(XP) ld U3, 24(XP)
FOLD(U0) ld T, 32(XP) addc U1, F0, U1 adde U2, F1, U2 adde U3, F2, U3 adde U0, U0, T
FOLDC(U1) ld T, 40(XP) addc U2, F0, U2 adde U3, F1, U3 adde U0, F2, U0 adde U1, U1, T
FOLDC(U2) ld T, 48(XP) addc U3, F0, U3 adde U0, F1, U0 adde U1, F2, U1 adde U2, U2, T
FOLDC(U3) ld T, 56(XP) addc U0, F0, U0 adde U1, F1, U1 adde U2, F2, U2 adde U3, U3, T
C If carry, we need to add in C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> li F0, 0 addze F0, F0 neg F2, F0 sldi F1, F2, 32 srdi T, F2, 32 li XP, -2 and T, T, XP
addc U0, F0, U0 adde U1, F1, U1 adde U2, F2, U2 adde U3, T, U3
std U0, 0(RP) std U1, 8(RP) std U2, 16(RP) std U3, 24(RP)
blr EPILOGUE(_nettle_ecc_secp256r1_redc)
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
nisse@lysator.liu.se (Niels Möller) writes:
I think it should be possible to reduce number of needed registers, and completely avoid using callee-save registers (load the values now in U4-U7 one at a time a bit closer to the place where they are needed in), and replace F3 with $1 in the FOLD and FOLDC macros.
Attaching a variant to do this. Passes tests with qemu, but I haven't benchmarked it on any real hardware.
Would you like to test and benchmark this on relevant real hardware, before I merged this version?
Code still below, and committed to the branch ppc-secp256-tweaks.
Regards, /Niels
C powerpc64/ecc-secp256r1-redc.asm
ifelse(` Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation
Based on x86_64/ecc-secp256r1-redc.asm
This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
C Register usage:
define(`RP', `r4') define(`XP', `r5')
define(`F0', `r3') define(`F1', `r6') define(`F2', `r7') define(`T', `r8')
define(`U0', `r9') define(`U1', `r10') define(`U2', `r11') define(`U3', `r12')
.file "ecc-secp256r1-redc.asm"
C FOLD(x), sets (x,F2,F1,F0) <-- [(x << 192) - (x << 160) + (x << 128) + (x <<32)] define(`FOLD', ` sldi F0, $1, 32 srdi F1, $1, 32 subfc F2, F0, $1 subfe $1, F1, $1 ')
C FOLDC(x), sets (x,F2,F1,F0) <-- [((x+c) << 192) - (x << 160) + (x << 128) + (x <<32)] define(`FOLDC', ` sldi F0, $1, 32 srdi F1, $1, 32 addze T, $1 subfc F2, F0, $1 subfe $1, F1, T ')
C void ecc_secp256r1_redc (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) .text define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_ecc_secp256r1_redc)
ld U0, 0(XP) ld U1, 8(XP) ld U2, 16(XP) ld U3, 24(XP)
FOLD(U0) ld T, 32(XP) addc U1, F0, U1 adde U2, F1, U2 adde U3, F2, U3 adde U0, U0, T
FOLDC(U1) ld T, 40(XP) addc U2, F0, U2 adde U3, F1, U3 adde U0, F2, U0 adde U1, U1, T
FOLDC(U2) ld T, 48(XP) addc U3, F0, U3 adde U0, F1, U0 adde U1, F2, U1 adde U2, U2, T
FOLDC(U3) ld T, 56(XP) addc U0, F0, U0 adde U1, F1, U1 adde U2, F2, U2 adde U3, U3, T
C If carry, we need to add in C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> li F0, 0 addze F0, F0 neg F2, F0 sldi F1, F2, 32 srdi T, F2, 32 li XP, -2 and T, T, XP
addc U0, F0, U0 adde U1, F1, U1 adde U2, F2, U2 adde U3, T, U3
std U0, 0(RP) std U1, 8(RP) std U2, 16(RP) std U3, 24(RP)
blr EPILOGUE(_nettle_ecc_secp256r1_redc)
Hi Niels,
On Tue, 2022-01-04 at 20:54 +0100, Niels Möller wrote:
nisse@lysator.liu.se (Niels Möller) writes:
nisse@lysator.liu.se (Niels Möller) writes:
I think it should be possible to reduce number of needed registers, and completely avoid using callee-save registers (load the values now in U4-U7 one at a time a bit closer to the place where they are needed in), and replace F3 with $1 in the FOLD and FOLDC macros.
Attaching a variant to do this. Passes tests with qemu, but I haven't benchmarked it on any real hardware.
Would you like to test and benchmark this on relevant real hardware, before I merged this version?
Code still below, and committed to the branch ppc-secp256-tweaks.
Compared to the current version in master branch, this version definitely improves the performance of the reduction code.
On POWER9, the reduction code shows 7% speed up when tested separately.
The improvement in P256 sign/verify is marginal. Here are the numbers from hogweed-benchmark on POWER9.
name size sign/ms verify/ms ecdsa 256 11.1013 3.5713 (master) ecdsa 256 11.1527 3.6011 (this patch)
Amitay.
Amitay Isaacs amitay@ozlabs.org writes:
Compared to the current version in master branch, this version definitely improves the performance of the reduction code.
On POWER9, the reduction code shows 7% speed up when tested separately.
The improvement in P256 sign/verify is marginal. Here are the numbers from hogweed-benchmark on POWER9.
name size sign/ms verify/ms ecdsa 256 11.1013 3.5713 (master) ecdsa 256 11.1527 3.6011 (this patch)
Thanks for testing. Committed to the master branch now.
Regards, /Niels
On Wed, 2021-12-01 at 22:58 +0100, Niels Möller wrote:
Amitay Isaacs amitay@ozlabs.org writes:
--- /dev/null +++ b/powerpc64/ecc-secp256r1-redc.asm @@ -0,0 +1,144 @@ +C powerpc64/ecc-secp256r1-redc.asm +ifelse(` + Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation
+ Based on x86_64/ecc-secp256r1-redc.asm
Looks good, and it seems method follows the x86_64 version closely. I just checked in a correction and a clarification to the comments to the x86_64 version.
A few comments below.
+C Register usage:
+define(`SP', `r1')
+define(`RP', `r4') +define(`XP', `r5')
+define(`F0', `r3') +define(`F1', `r6') +define(`F2', `r7') +define(`F3', `r8')
+define(`U0', `r9') +define(`U1', `r10') +define(`U2', `r11') +define(`U3', `r12') +define(`U4', `r14') +define(`U5', `r15') +define(`U6', `r16') +define(`U7', `r17')
One could save one register by letting U7 and XP overlap, since XP isn't used after loading U7.
+ .file "ecc-secp256r1-redc.asm"
+C FOLD(x), sets (F3,F2,F1,F0) <-- [(x << 224) - (x << 192) - (x << 96)] >> 64 +define(`FOLD', ` + sldi F2, $1, 32 + srdi F3, $1, 32 + li F0, 0 + li F1, 0 + subfc F0, F2, F0 + subfe F1, F3, F1
I think the
li F0, 0 li F1, 0 subfc F0, F2, F0 subfe F1, F3, F1
could be replaced with
subfic F0, F2, 0 C "negate with borrow" subfze F1, F3
If that is measurably faster, I can't say.
You are quick to find the exactly fitting instruction. Yes, it definitely does the same job with two less instructions and gives about 1% speedup for only reduction code.
Another option: Since powerpc, like arm, seems to use the proper two's complement convention that "borrow" is not carry, maybe we don't need to negate to F0 and F1 at all, and instead change the later subtraction, replacing
subfc U1, F0, U1 subfe U2, F1, U2 subfe U3, F2, U3 subfe U0, F3, U0
with
addc U1, F0, U1 adde U2, F1, U2 subfe U3, F2, U3 subfe U0, F3, U0
I haven't thought that through, but it does make some sense to me. I think the arm code propagates carry through a mix of add and sub instructions in a some places. Maybe F2 needs to be incremented somewhere for this to work, but probably still cheaper. If this works, FOLD would turn into something like
sldi F0, $1, 32 srdi F1, $1, 32 subfc F2, $1, F0 addme F3, F1
(If you want to investigate this later on, that's fine too, I could merge the code with the current folding logic).
+ C If carry, we need to add in + C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> + li F0, 0 + addze F0, F0 + neg F2, F0 + sldi F1, F2, 32 + srdi F3, F2, 32 + li U7, -2 + and F3, F3, U7
I think the three instructions to set F3 could be replaced with
srdi F3, F2, 31 sldi F3, F3, 1
Or maybe the and operation is faster than shift?
Regards, /Niels
I will continue to investigate the suggestions you have made.
Amitay.
From: Martin Schwenke martin@meltin.net
Signed-off-by: Martin Schwenke martin@meltin.net Signed-off-by: Amitay Isaacs amitay@ozlabs.org Signed-off-by: Alastair D'Silva alastair@d-silva.org --- powerpc64/ecc-secp384r1-modp.asm | 227 +++++++++++++++++++++++++++++++ 1 file changed, 227 insertions(+) create mode 100644 powerpc64/ecc-secp384r1-modp.asm
diff --git a/powerpc64/ecc-secp384r1-modp.asm b/powerpc64/ecc-secp384r1-modp.asm new file mode 100644 index 00000000..67791f09 --- /dev/null +++ b/powerpc64/ecc-secp384r1-modp.asm @@ -0,0 +1,227 @@ +C powerpc64/ecc-secp384r1-modp.asm + +ifelse(` + Copyright (C) 2021 Martin Schwenke, Amitay Isaacs & Alastair D´Silva, IBM Corporation + + Based on x86_64/ecc-secp256r1-redc.asm + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + .file "ecc-secp384r1-modp.asm" + +C Register usage: + +define(`SP', `r1') + +define(`RP', `r4') +define(`XP', `r5') + +define(`D5', `r6') +define(`T0', `r7') +define(`T1', `r8') +define(`T2', `r9') +define(`T3', `r10') +define(`T4', `r11') +define(`T5', `r12') +define(`H0', `r14') +define(`H1', `r15') +define(`H2', `r16') +define(`H3', `r17') +define(`H4', `r18') +define(`H5', `r19') +define(`C2', `r3') +define(`C0', H5) C Overlap +define(`TMP', XP) C Overlap + + + C void ecc_secp384r1_modp (const struct ecc_modulo *m, mp_limb_t *rp, mp_limb_t *xp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp384r1_modp) + + std H0, -48(SP) + std H1, -40(SP) + std H2, -32(SP) + std H3, -24(SP) + std H4, -16(SP) + std H5, -8(SP) + + C First get top 2 limbs, which need folding twice. + C B^10 = B^6 + B^4 + 2^32 (B-1)B^4. + C We handle the terms as follow: + C + C B^6: Folded immediatly. + C + C B^4: Delayed, added in in the next folding. + C + C 2^32(B-1) B^4: Low half limb delayed until the next + C folding. Top 1.5 limbs subtracted and shifter now, resulting + C in 2.5 limbs. The low limb saved in D5, high 1.5 limbs added + C in. + + ld H4, 80(XP) + ld H5, 88(XP) + C Shift right 32 bits, into H1, H0 + srdi H1, H5, 32 + sldi D5, H5, 32 + srdi H0, H4, 32 + or H0, H0, D5 + + C H1 H0 + C - H1 H0 + C -------- + C H1 H0 D5 + subfic D5, H0, 0 + subfe H0, H1, H0 + addme H1, H1 + + li C2, 0 + addc H0, H4, H0 + adde H1, H5, H1 + addze C2, C2 + + C Add in to high part + ld T1, 48(XP) + ld T2, 56(XP) + addc H0, T1, H0 + adde H1, T2, H1 + addze C2, C2 C Do C2 later + + C +1 term + ld T0, 0(XP) + ld T1, 8(XP) + ld T2, 16(XP) + ld T3, 24(XP) + ld T4, 32(XP) + ld T5, 40(XP) + ld H2, 64(XP) + ld H3, 72(XP) + addc T0, H0, T0 + adde T1, H1, T1 + adde T2, H2, T2 + adde T3, H3, T3 + adde T4, H4, T4 + adde T5, H5, T5 + li C0, 0 + addze C0, C0 + + C +B^2 term + addc T2, H0, T2 + adde T3, H1, T3 + adde T4, H2, T4 + adde T5, H3, T5 + addze C0, C0 + + C Shift left, including low half of H4 + sldi H4, H4, 32 + srdi TMP, H3, 32 + or H4, TMP, H4 + + sldi H3, H3, 32 + srdi TMP, H2, 32 + or H3, TMP, H3 + + sldi H2, H2, 32 + srdi TMP, H1, 32 + or H2, TMP, H2 + + sldi H1, H1, 32 + srdi TMP, H0, 32 + or H1, TMP, H1 + + sldi H0, H0, 32 + + C H4 H3 H2 H1 H0 0 + C - H4 H3 H2 H1 H0 + C --------------- + C H4 H3 H2 H1 H0 TMP + + subfic TMP, H0, 0 + subfe H0, H1, H0 + subfe H1, H2, H1 + subfe H2, H3, H2 + subfe H3, H4, H3 + addme H4, H4 + + addc T0, TMP, T0 + adde T1, H0, T1 + adde T2, H1, T2 + adde T3, H2, T3 + adde T4, H3, T4 + adde T5, H4, T5 + addze C0, C0 + + C Remains to add in C2 and C0 + C Set H1, H0 = (2^96 - 2^32 + 1) C0 + sldi H1, C0, 32 + subfc H0, H1, C0 + addme H1, H1 + + C Set H3, H2 = (2^96 - 2^32 + 1) C2 + sldi H3, C2, 32 + subfc H2, H3, C2 + addme H3, H3 + addc H2, C0, H2 + + li C0, 0 + addc T0, H0, T0 + adde T1, H1, T1 + adde T2, H2, T2 + adde T3, H3, T3 + adde T4, C2, T4 + adde T5, D5, T5 C Value delayed from initial folding + addze C0, C0 + + C Final unlikely carry + sldi H1, C0, 32 + subfc H0, H1, C0 + addme H1, H1 + + addc T0, H0, T0 + adde T1, H1, T1 + adde T2, C0, T2 + addze T3, T3 + addze T4, T4 + addze T5, T5 + + std T0, 0(RP) + std T1, 8(RP) + std T2, 16(RP) + std T3, 24(RP) + std T4, 32(RP) + std T5, 40(RP) + + ld H0, -48(SP) + ld H1, -40(SP) + ld H2, -32(SP) + ld H3, -24(SP) + ld H4, -16(SP) + ld H5, -8(SP) + + blr +EPILOGUE(_nettle_ecc_secp384r1_modp)
Amitay Isaacs amitay@ozlabs.org writes:
diff --git a/powerpc64/ecc-secp384r1-modp.asm b/powerpc64/ecc-secp384r1-modp.asm new file mode 100644 index 00000000..67791f09 --- /dev/null +++ b/powerpc64/ecc-secp384r1-modp.asm @@ -0,0 +1,227 @@ +C powerpc64/ecc-secp384r1-modp.asm
This looks nice (and it seems folding scheme is the same as for the x86_64 version). Just one minor thing,
+define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp384r1_modp)
- std H0, -48(SP)
- std H1, -40(SP)
- std H2, -32(SP)
- std H3, -24(SP)
- std H4, -16(SP)
- std H5, -8(SP)
I find it clearer to use register names rather than the m4 defines for save and restore of callee-save registers.
Regards, /Niels
On Tue, 2022-01-04 at 21:28 +0100, Niels Möller wrote:
+define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp384r1_modp)
+ std H0, -48(SP) + std H1, -40(SP) + std H2, -32(SP) + std H3, -24(SP) + std H4, -16(SP) + std H5, -8(SP)
I find it clearer to use register names rather than the m4 defines for save and restore of callee-save registers.
Here's the modified code which uses the actual registers when saving and restoring from stack.
Amitay.
From: Martin Schwenke martin@meltin.net
Signed-off-by: Martin Schwenke martin@meltin.net Signed-off-by: Alastair D'Silva alastair@d-silva.org --- powerpc64/ecc-secp521r1-modp.asm | 166 +++++++++++++++++++++++++++++++ 1 file changed, 166 insertions(+) create mode 100644 powerpc64/ecc-secp521r1-modp.asm
diff --git a/powerpc64/ecc-secp521r1-modp.asm b/powerpc64/ecc-secp521r1-modp.asm new file mode 100644 index 00000000..a9d197c9 --- /dev/null +++ b/powerpc64/ecc-secp521r1-modp.asm @@ -0,0 +1,166 @@ +C powerpc64/ecc-secp521r1-modp.asm + +ifelse(` + Copyright (C) 2021 Martin Schwenke & Alastair D´Silva, IBM Corporation + + Based on x86_64/ecc-secp521r1-modp.asm + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + .file "ecc-secp521r1-modp.asm" + +define(`SP', `r1') + +define(`RP', `r4') +define(`XP', `r5') + +define(`U0', `r6') +define(`U1', `r7') +define(`U2', `r8') +define(`U3', `r9') +define(`U4', `r10') +define(`U5', `r11') +define(`U6', `r12') +define(`U7', `r14') +define(`U8', `r15') +define(`U9', `r16') + +define(`T0', `r3') +define(`T1', `r17') + + + C void ecc_secp521r1_modp (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp521r1_modp) + + std U7, -32(SP) + std U8, -24(SP) + std U9, -16(SP) + std T1, -8(SP) + + C Read top 17 limbs, shift left 55 bits + ld U1, 72(XP) + sldi U0, U1, 55 + srdi U1, U1, 9 + + ld T0, 80(XP) + srdi U2, T0, 9 + sldi T0, T0, 55 + or U1, T0, U1 + + ld T0, 88(XP) + srdi U3, T0, 9 + sldi T0, T0, 55 + or U2, T0, U2 + + ld T0, 96(XP) + srdi U4, T0, 9 + sldi T0, T0, 55 + or U3, T0, U3 + + ld T0, 104(XP) + srdi U5, T0, 9 + sldi T0, T0, 55 + or U4, T0, U4 + + ld T0, 112(XP) + srdi U6, T0, 9 + sldi T0, T0, 55 + or U5, T0, U5 + + ld T0, 120(XP) + srdi U7, T0, 9 + sldi T0, T0, 55 + or U6, T0, U6 + + ld T0, 128(XP) + srdi U8, T0, 9 + sldi T0, T0, 55 + or U7, T0, U7 + + ld T0, 136(XP) + srdi U9, T0, 9 + sldi T0, T0, 55 + or U8, T0, U8 + + ld T0, 0(XP) + ld T1, 8(XP) + addc U0, T0, U0 + adde U1, T1, U1 + ld T0, 16(XP) + ld T1, 24(XP) + adde U2, T0, U2 + adde U3, T1, U3 + ld T0, 32(XP) + ld T1, 40(XP) + adde U4, T0, U4 + adde U5, T1, U5 + ld T0, 48(XP) + ld T1, 56(XP) + adde U6, T0, U6 + adde U7, T1, U7 + ld T0, 64(XP) + adde U8, T0, U8 + addze U9, U9 + + C Top limbs are <U9, U8>. Keep low 9 bits of 8, and fold the + C top bits (at most 65 bits). + srdi T0, U8, 9 + andi. U8, U8, 0x1ff + srdi T1, U9, 9 + sldi U9, U9, 55 + or T0, U9, T0 + + addc U0, T0, U0 + adde U1, T1, U1 + addze U2, U2 + addze U3, U3 + addze U4, U4 + addze U5, U5 + addze U6, U6 + addze U7, U7 + addze U8, U8 + + std U0, 0(RP) + std U1, 8(RP) + std U2, 16(RP) + std U3, 24(RP) + std U4, 32(RP) + std U5, 40(RP) + std U6, 48(RP) + std U7, 56(RP) + std U8, 64(RP) + + ld U7, -32(SP) + ld U8, -24(SP) + ld U9, -16(SP) + ld T1, -8(SP) + + blr +EPILOGUE(_nettle_ecc_secp521r1_modp)
From: Martin Schwenke martin@meltin.net
Signed-off-by: Martin Schwenke martin@meltin.net Signed-off-by: Alastair D'Silva alastair@d-silva.org --- powerpc64/ecc-curve25519-modp.asm | 103 ++++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 powerpc64/ecc-curve25519-modp.asm
diff --git a/powerpc64/ecc-curve25519-modp.asm b/powerpc64/ecc-curve25519-modp.asm new file mode 100644 index 00000000..4f9f8f73 --- /dev/null +++ b/powerpc64/ecc-curve25519-modp.asm @@ -0,0 +1,103 @@ +C powerpc64/ecc-25519-modp.asm + +ifelse(` + Copyright (C) 2021 Martin Schwenke & Alastair D´Silva, IBM Corporation + + Based on x86_64/ecc-25519-modp.asm + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + .file "ecc-25519-modp.asm" + +define(`SP', `r1') + +define(`RP', `r4') +define(`XP', `r5') + +define(`U0', `r6') C Overlaps unused modulo input +define(`U1', `r7') +define(`U2', `r8') +define(`U3', `r9') +define(`T0', `r10') +define(`T1', `r11') +define(`M', `r12') + +define(`UN', r3) + + C void ecc_curve25519_modp (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_curve25519_modp) + + C First fold the limbs affecting bit 255 + ld UN, 56(XP) + li M, 38 + mulhdu T1, M, UN + mulld UN, M, UN + ld U3, 24(XP) + li T0, 0 + addc U3, UN, U3 + adde T0, T1, T0 + + ld UN, 40(XP) + mulhdu U2, M, UN + mulld UN, M, UN + + addc U3, U3, U3 + adde T0, T0, T0 + srdi U3, U3, 1 C Undo shift, clear high bit + + C Fold the high limb again, together with RP[5] + li T1, 19 + mulld T0, T1, T0 + ld U0, 0(XP) + ld U1, 8(XP) + ld T1, 16(XP) + addc U0, T0, U0 + adde U1, UN, U1 + ld T0, 32(XP) + adde U2, U2, T1 + addze U3, U3 + + mulhdu T1, M, T0 + mulld T0, M, T0 + addc U0, T0, U0 + adde U1, T1, U1 + std U0, 0(RP) + std U1, 8(RP) + + ld T0, 48(XP) + mulhdu T1, M, T0 + mulld UN, M, T0 + adde U2, UN, U2 + adde U3, T1, U3 + std U2, 16(RP) + std U3, 24(RP) + + blr +EPILOGUE(_nettle_ecc_curve25519_modp)
From: Martin Schwenke martin@meltin.net
Signed-off-by: Martin Schwenke martin@meltin.net Signed-off-by: Amitay Isaacs amitay@gmail.com --- powerpc64/ecc-curve448-modp.asm | 174 ++++++++++++++++++++++++++++++++ 1 file changed, 174 insertions(+) create mode 100644 powerpc64/ecc-curve448-modp.asm
diff --git a/powerpc64/ecc-curve448-modp.asm b/powerpc64/ecc-curve448-modp.asm new file mode 100644 index 00000000..411c8df7 --- /dev/null +++ b/powerpc64/ecc-curve448-modp.asm @@ -0,0 +1,174 @@ +C powerpc/ecc-curve448-modp.asm + +ifelse(` + Copyright (C) 2021 Martin Schwenke & Amitay Isaacs, IBM Corporation + + Based on x86_64/ecc-curve448-modp.asm + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + .file "ecc-curve448-modp.asm" + +define(`SP', `r1') + +define(`RP', `r4') +define(`XP', `r5') + +define(`X0', `r3') +define(`X1', `r9') +define(`X2', `r10') +define(`X3', `r11') +define(`X4', `r12') +define(`X5', `r14') +define(`X6', `r15') +define(`X7', `r16') +define(`T0', `r6') +define(`T1', `r7') +define(`T2', `r8') +define(`TT', `r17') + +define(`LO', `TT') C Overlap + + C void ecc_curve448_modp (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_curve448_modp) + + std X5, -32(SP) + std X6, -24(SP) + std X7, -16(SP) + std TT, -8(SP) + + C First load the values to be shifted by 32. + ld T0, 88(XP) C use for X0, X1 + ld T1, 96(XP) C use for X2 + ld T2, 104(XP) C use for X3 + ld X4, 56(XP) + ld X5, 64(XP) + ld X6, 72(XP) + ld X7, 80(XP) + + C Multiply by 2^32 + sldi X0, T0, 32 + srdi LO, T0, 32 + sldi X1, T1, 32 + or X1, X1, LO + srdi LO, T1, 32 + sldi X2, T2, 32 + or X2, X2, LO + srdi LO, T2, 32 + sldi X3, X4, 32 + or X3, X3, LO + srdi LO, X4, 32 + sldi X4, X5, 32 + or X4, X4, LO + srdi LO, X5, 32 + sldi X5, X6, 32 + or X5, X5, LO + srdi LO, X6, 32 + sldi X6, X7, 32 + or X6, X6, LO + + srdi X7, X7, 32 + + C Multiply by 2 + addc T0, T0, T0 + adde T1, T1, T1 + adde T2, T2, T2 + addze X7, X7 + + C Main additions + ld TT, 56(XP) + addc X0, TT, X0 + ld TT, 64(XP) + adde X1, TT, X1 + ld TT, 72(XP) + adde X2, TT, X2 + ld TT, 80(XP) + adde X3, TT, X3 + adde X4, T0, X4 + adde X5, T1, X5 + adde X6, T2, X6 + addze X7, X7 + + ld T0, 0(XP) + addc X0, T0, X0 + ld T1, 8(XP) + adde X1, T1, X1 + ld T2, 16(XP) + adde X2, T2, X2 + ld TT, 24(XP) + adde X3, TT, X3 + ld T0, 32(XP) + adde X4, T0, X4 + ld T1, 40(XP) + adde X5, T1, X5 + ld T2, 48(XP) + adde X6, T2, X6 + addze X7, X7 + + C X7 wraparound + sldi T0, X7, 32 + srdi T1, X7, 32 + li T2, 0 + addc X0, X7, X0 + addze X1, X1 + addze X2, X2 + adde X3, T0, X3 + adde X4, T1, X4 + addze X5, X5 + addze X6, X6 + addze T2, T2 + + C Final carry wraparound. Carry T2 > 0 only if + C X6 is zero, so carry is absorbed. + sldi T0, T2, 32 + + addc X0, T2, X0 + addze X1, X1 + addze X2, X2 + adde X3, T0, X3 + addze X4, X4 + addze X5, X5 + addze X6, X6 + + std X0, 0(RP) + std X1, 8(RP) + std X2, 16(RP) + std X3, 24(RP) + std X4, 32(RP) + std X5, 40(RP) + std X6, 48(RP) + + ld X5, -32(SP) + ld X6, -24(SP) + ld X7, -16(SP) + ld TT, -8(SP) + + blr +EPILOGUE(_nettle_ecc_curve448_modp)
Amitay Isaacs amitay@ozlabs.org writes:
This series of patches add the powerpc64 assembly for modp/redc functions for elliptic curves P192, P224, P256, P384, P521, X25519 and X448. It results in 15-30% performance improvements as measured on POWER9 system using hogweed-benchmark.
Nice. For testing these functions, I recommend running
while NETTLE_TEST_SEED=0 ./testsuite/ecc-mod-test ; do : ; done
and
while NETTLE_TEST_SEED=0 ./testsuite/ecc-redc-test ; do : ; done
for a few hours.
Regards, /Niels
On Sun, 2021-11-28 at 11:17 +0100, Niels Möller wrote:
Amitay Isaacs amitay@ozlabs.org writes:
This series of patches add the powerpc64 assembly for modp/redc functions for elliptic curves P192, P224, P256, P384, P521, X25519 and X448. It results in 15-30% performance improvements as measured on POWER9 system using hogweed-benchmark.
Nice. For testing these functions, I recommend running
while NETTLE_TEST_SEED=0 ./testsuite/ecc-mod-test ; do : ; done
and
while NETTLE_TEST_SEED=0 ./testsuite/ecc-redc-test ; do : ; done
for a few hours.
Regards, /Niels
I have both tests running for more than 24 hours without failure.
Is there anything else I need to do?
Thanks.
Amitay.
nettle-bugs@lists.lysator.liu.se