[PATCH 0/7] Add powerpc64 assembly for elliptic curves

List overview All Threads
Download

newer

older

[PATCH v2 0/6] Add powerpc64...

[Arm64, S390x] Optimize Chacha20

Amitay Isaacs

25 Nov 2021 25 Nov '21

8:47 a.m.

Hi,

This series of patches add the powerpc64 assembly for modp/redc functions for elliptic curves P192, P224, P256, P384, P521, X25519 and X448. It results in 15-30% performance improvements as measured on POWER9 system using hogweed-benchmark.

master ------ name size sign/ms verify/ms

ecdsa 192 13.5465 3.8313 ecdsa 224 8.4079 2.5289 ecdsa 256 8.5772 2.4691 ecdsa 384 3.4361 0.9765 ecdsa 521 2.2918 0.6785

eddsa 255 15.0672 4.1138 eddsa 448 4.7115 1.3067 curve 255 15.5206 6.4745 curve 448 4.7906 2.0458

this series ----------- name size sign/ms verify/ms

ecdsa 192 17.5450 5.6564 ecdsa 224 10.8847 3.7482 ecdsa 256 11.0208 3.5215 ecdsa 384 5.1133 1.6133 ecdsa 521 2.8815 0.9348

eddsa 255 17.4936 4.9027 eddsa 448 6.0401 1.7075 curve 255 18.1034 7.6884 curve 448 6.1756 2.7095

Amitay Isaacs (3): ecc: Add powerpc64 assembly for ecc_192_modp ecc: Add powerpc64 assembly for ecc_224_modp ecc: Add powerpc64 assembly for ecc_256_redc

Martin Schwenke (4): ecc: Add powerpc64 assembly for ecc_384_modp ecc: Add powerpc64 assembly for ecc_521_modp ecc: Add powerpc64 assembly for ecc_25519_modp ecc: Add powerpc64 assembly for ecc_448_modp

powerpc64/ecc-curve25519-modp.asm | 103 ++++++++++++++ powerpc64/ecc-curve448-modp.asm | 174 +++++++++++++++++++++++ powerpc64/ecc-secp192r1-modp.asm | 93 ++++++++++++ powerpc64/ecc-secp224r1-modp.asm | 123 ++++++++++++++++ powerpc64/ecc-secp256r1-redc.asm | 144 +++++++++++++++++++ powerpc64/ecc-secp384r1-modp.asm | 227 ++++++++++++++++++++++++++++++ powerpc64/ecc-secp521r1-modp.asm | 166 ++++++++++++++++++++++ 7 files changed, 1030 insertions(+) create mode 100644 powerpc64/ecc-curve25519-modp.asm create mode 100644 powerpc64/ecc-curve448-modp.asm create mode 100644 powerpc64/ecc-secp192r1-modp.asm create mode 100644 powerpc64/ecc-secp224r1-modp.asm create mode 100644 powerpc64/ecc-secp256r1-redc.asm create mode 100644 powerpc64/ecc-secp384r1-modp.asm create mode 100644 powerpc64/ecc-secp521r1-modp.asm

-- 2.33.1

Show replies by date

Amitay Isaacs

25 Nov 25 Nov

8:47 a.m.

New subject: [PATCH 1/7] ecc: Add powerpc64 assembly for ecc_192_modp

Signed-off-by: Amitay Isaacs amitay@ozlabs.org --- powerpc64/ecc-secp192r1-modp.asm | 93 ++++++++++++++++++++++++++++++++ 1 file changed, 93 insertions(+) create mode 100644 powerpc64/ecc-secp192r1-modp.asm

diff --git a/powerpc64/ecc-secp192r1-modp.asm b/powerpc64/ecc-secp192r1-modp.asm new file mode 100644 index 00000000..97c71a83 --- /dev/null +++ b/powerpc64/ecc-secp192r1-modp.asm @@ -0,0 +1,93 @@ +C powerpc64/ecc-secp192r1-modp.asm + +ifelse(` + Copyright (C) 2021 Amitay Isaacs, IBM Corporation + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + .file "ecc-secp192r1-modp.asm" + +define(`RP', `r4') +define(`XP', `r5') + +define(`T0', `r6') +define(`T1', `r7') +define(`T2', `r8') +define(`T3', `r9') +define(`C1', `r10') +define(`C2', `r11') + + C void ecc_secp192r1_modp (const struct ecc_modulo *m, mp_limb_t *rp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp192r1_modp) + ld T0, 0(XP) + ld T1, 8(XP) + ld T2, 16(XP) + + li C1, 0 + li C2, 0 + + ld T3, 24(XP) + addc T0, T3, T0 + adde T1, T3, T1 + addze T2, T2 + addze C1, C1 + + ld T3, 32(XP) + addc T1, T3, T1 + adde T2, T3, T2 + addze C1, C1 + + ld T3, 40(XP) + addc T0, T3, T0 + adde T1, T3, T1 + adde T2, T3, T2 + addze C1, C1 + + addc T0, C1, T0 + adde T1, C1, T1 + addze T2, T2 + addze C2, C2 + + li C1, 0 + addc T0, C2, T0 + adde T1, C2, T1 + addze T2, T2 + addze C1, C1 + + addc T0, C1, T0 + adde T1, C1, T1 + addze T2, T2 + + std T0, 0(RP) + std T1, 8(RP) + std T2, 16(RP) + + blr +EPILOGUE(_nettle_ecc_secp192r1_modp)

-- 2.33.1

nisse＠lysator.liu.se

30 Nov 30 Nov

7:17 p.m.

New subject: [PATCH 1/7] ecc: Add powerpc64 assembly for ecc_192_modp

Amitay Isaacs amitay@ozlabs.org writes:

...

.file "ecc-secp192r1-modp.asm"

Thanks, I'm looking at this file first (being the simplest, even though the security level of this curve is a bit low for current usage, so performance not of so great importance).

I'm quite new to powerpc, so I'm refering to the instruction reference, and trying to learn as we go along. It seems addc is addition with carry output (but no carry input), adde is addition with carry input and output, and addze is addition of zero with carry input and output.

...

+define(`RP', `r4') +define(`XP', `r5')

+define(`T0', `r6') +define(`T1', `r7') +define(`T2', `r8') +define(`T3', `r9') +define(`C1', `r10') +define(`C2', `r11')

As I understand it, we could also use register r3 (unused input argument), but we don't need to, since we have enough free scratch registers.

...

C void ecc_secp192r1_modp (const struct ecc_modulo *m, mp_limb_t *rp)

.text

+define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp192r1_modp)

ld T0, 0(XP)

ld T1, 8(XP)

ld T2, 16(XP)

li C1, 0

li C2, 0

ld T3, 24(XP)

addc T0, T3, T0

adde T1, T3, T1

addze T2, T2

addze C1, C1

ld T3, 32(XP)

addc T1, T3, T1

adde T2, T3, T2

addze C1, C1

ld T3, 40(XP)

addc T0, T3, T0

adde T1, T3, T1

adde T2, T3, T2

addze C1, C1

To analyze what we are doing, I'm using the Nettle and GMP convention that B = 2^64 (bignum base), then p = B^3 - B - 1, or B^3 = B + 1 (mod p). Denote the six input words as

<a_5,a_4,a_3,a_2,a_1,a_0>

representing the number

B^5 a_5 + B^4 a_4 + B^3 a_3 + B^2 a_2 + B a_1 + a_0

The accumulation above, as I understand it, computes

<c_1,t_2,t_1,t_0> = <a_2,a_1,a_0> + a_3 (B+1) + a_4 (B^2 + B) + a_5 (B^2 + B + 1>

or more graphically,

a_2 a_2 a_1 a_3 a_3 a_4 a_4 + a_5 a_5 a_5 --------------- c_1 t_2 t_1 t_0

This number is < 3 B^3, which means that c_1 is 0, 1 or 2 (each of the addze instructions can increment it).

This looks nice, and I think it is pretty efficient too. It looks a bit different from what the x86_64 code is doing; maybe the latter could be improved.

...

addc T0, C1, T0

adde T1, C1, T1

addze T2, T2

addze C2, C2

Above, c_1 is folded in at the right places,

<c_2,t_2,t_1,t_0> <-- <t_2, t_1, t_0> + c_1 (B + 1)

This number is < B^3 + 3 (B+1). This implies that in the (quite unlikely) case we get carry out, i.e., c_2 = 1, then the value of the low three words is < 3 (B+1). That means that there can be no new carry out when folding c_2.

...

li C1, 0

addc T0, C2, T0

adde T1, C2, T1

addze T2, T2

addze C1, C1

addc T0, C1, T0

adde T1, C1, T1

addze T2, T2

So I think this final folding could be reduced to just

addc T0, C2, T0 adde T1, C2, T1 addze T2, T2

There's no carry out, from this, because either C2 was zero, or T2 was small, <= 3. Does that make sense?

...

std T0, 0(RP)

std T1, 8(RP)

std T2, 16(RP)

blr

+EPILOGUE(_nettle_ecc_secp192r1_modp)

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

Amitay Isaacs

11:51 p.m.

New subject: [PATCH 1/7] ecc: Add powerpc64 assembly for ecc_192_modp

On Tue, 2021-11-30 at 20:17 +0100, Niels Möller wrote:

...

Amitay Isaacs amitay@ozlabs.org writes:

...
+ .file "ecc-secp192r1-modp.asm"

Thanks, I'm looking at this file first (being the simplest, even though the security level of this curve is a bit low for current usage, so performance not of so great importance).

Yes. The main focus was on curves p256 and higher. For completeness sake I added the assembly for p192 and p224 curves.

...

I'm quite new to powerpc, so I'm refering to the instruction reference, and trying to learn as we go along. It seems addc is addition with carry output (but no carry input), adde is addition with carry input and output, and addze is addition of zero with carry input and output.

...
+define(`RP', `r4') +define(`XP', `r5')

+define(`T0', `r6') +define(`T1', `r7') +define(`T2', `r8') +define(`T3', `r9') +define(`C1', `r10') +define(`C2', `r11')

As I understand it, we could also use register r3 (unused input argument), but we don't need to, since we have enough free scratch registers.

Generally I avoided using r3 as it is the first input argument, but also serves as the return value of the function. So unless we are running short of registers r3 is left untouched. (Even though there are 32 registers, only some of them can be used without saving them first.) For example, r3 is used in p224 modp implementation.

...

...
+       C void ecc_secp192r1_modp (const struct ecc_modulo *m, mp_limb_t *rp) +       .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp192r1_modp) +       ld      T0, 0(XP) +       ld      T1, 8(XP) +       ld      T2, 16(XP)

+       li      C1, 0 +       li      C2, 0

+       ld      T3, 24(XP) +       addc    T0, T3, T0 +       adde    T1, T3, T1 +       addze   T2, T2 +       addze   C1, C1

+       ld      T3, 32(XP) +       addc    T1, T3, T1 +       adde    T2, T3, T2 +       addze   C1, C1

+       ld      T3, 40(XP) +       addc    T0, T3, T0 +       adde    T1, T3, T1 +       adde    T2, T3, T2 +       addze   C1, C1

To analyze what we are doing, I'm using the Nettle and GMP convention that B = 2^64 (bignum base), then p = B^3 - B - 1, or B^3 = B + 1 (mod p). Denote the six input words as

<a_5,a_4,a_3,a_2,a_1,a_0>

representing the number

B^5 a_5 + B^4 a_4 + B^3 a_3 + B^2 a_2 + B a_1 + a_0

The accumulation above, as I understand it, computes

<c_1,t_2,t_1,t_0> = <a_2,a_1,a_0> + a_3 (B+1) + a_4 (B^2 + B)                     + a_5 (B^2 + B + 1>

or more graphically,

a_2 a_2 a_1           a_3 a_3       a_4 a_4     + a_5 a_5 a_5 --------------- c_1 t_2 t_1 t_0

This number is < 3 B^3, which means that c_1 is 0, 1 or 2 (each of the addze instructions can increment it).

This looks nice, and I think it is pretty efficient too. It looks a bit different from what the x86_64 code is doing; maybe the latter could be improved.

...
+       addc    T0, C1, T0 +       adde    T1, C1, T1 +       addze   T2, T2 +       addze   C2, C2

Above, c_1 is folded in at the right places,

<c_2,t_2,t_1,t_0> <-- <t_2, t_1, t_0> + c_1 (B + 1)

This number is < B^3 + 3 (B+1). This implies that in the (quite unlikely) case we get carry out, i.e., c_2 = 1, then the value of the low three words is < 3 (B+1). That means that there can be no new carry out when folding c_2.

...
+       li      C1, 0 +       addc    T0, C2, T0 +       adde    T1, C2, T1 +       addze   T2, T2 +       addze   C1, C1

+       addc    T0, C1, T0 +       adde    T1, C1, T1 +       addze   T2, T2

So I think this final folding could be reduced to just

addc    T0, C2, T0         adde    T1, C2, T1         addze   T2, T2

There's no carry out, from this, because either C2 was zero, or T2 was small, <= 3. Does that make sense?

This was the first code I wrote using the exact calculation you have outlined above. However, I was not sure about the sizes of the carry (C1 and C2). I did notice the x86 code short-cutting the C2 folding, but the reasoning was not apparent.

Thank you for explaining the bounds calculation. That does help me understand why C2 folding can be simplified as you have suggested.

...

...
+ std T0, 0(RP) + std T1, 8(RP) + std T2, 16(RP)

+ blr +EPILOGUE(_nettle_ecc_secp192r1_modp)

Regards, /Niels

Amitay.

-- Good literature is about Love and War. Junk Fiction is about Sex and Violence.

Amitay Isaacs

21 Jan 21 Jan

3:51 a.m.

New subject: [PATCH 1/7] ecc: Add powerpc64 assembly for ecc_192_modp

Here's the updated code for P192 curve after simplifying C2 folding.

Amitay.

-- Retirement: When you quit working just before your heart does. Retirement: When you quit working just before your heart does.

Amitay Isaacs

25 Nov 25 Nov

8:47 a.m.

New subject: [PATCH 2/7] ecc: Add powerpc64 assembly for ecc_224_modp

Signed-off-by: Amitay Isaacs amitay@ozlabs.org --- powerpc64/ecc-secp224r1-modp.asm | 123 +++++++++++++++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 powerpc64/ecc-secp224r1-modp.asm

diff --git a/powerpc64/ecc-secp224r1-modp.asm b/powerpc64/ecc-secp224r1-modp.asm new file mode 100644 index 00000000..e993376f --- /dev/null +++ b/powerpc64/ecc-secp224r1-modp.asm @@ -0,0 +1,123 @@ +C powerpc64/ecc-secp224r1-modp.asm + +ifelse(` + Copyright (C) 2021 Amitay Isaacs, IBM Corporation + + Based on x86_64/ecc-secp224r1-modp.asm + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + .file "ecc-secp224r1-modp.asm" + +define(`SP', `r1') + +define(`RP', `r4') +define(`XP', `r5') + +define(`T0', `r6') +define(`T1', `r7') +define(`H0', `r8') +define(`H1', `r9') +define(`H2', `r10') +define(`F0', `r11') +define(`F1', `r12') +define(`F2', `r14') +define(`T2', `r3') + + C void ecc_secp224r1_modp (const struct ecc_modulo *m, mp_limb_t *rp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp224r1_modp) + std F2, -8(SP) + + ld H0, 48(XP) + ld H1, 56(XP) + C set (F2, F1, F0) <-- (H1, H0) << 32 + sldi F0, H0, 32 + srdi F1, H0, 32 + sldi T0, H1, 32 + srdi F2, H1, 32 + or F1, T0, F1 + + li H2, 0 + ld T0, 16(XP) + ld T1, 24(XP) + subfc T0, F0, T0 + subfe T1, F1, T1 + subfe H0, F2, H0 + addme H1, H1 + + ld T2, 32(XP) + addc H0, T2, H0 + ld T2, 40(XP) + adde H1, T2, H1 + addze H2, H2 + + C Set (F2, F1, F0) <-- (H2, H1, H0) << 32 + sldi F0, H0, 32 + srdi F1, H0, 32 + addc H0, T0, H0 + sldi T0, H1, 32 + srdi F2, H1, 32 + adde H1, T1, H1 + sldi T1, H2, 32 + addze H2, H2 + or F1, T0, F1 + or F2, T1, F2 + + ld T0, 0(XP) + ld T1, 8(XP) + subfc T0, F0, T0 + subfe T1, F1, T1 + subfe H0, F2, H0 + addme H1, H1 + addme H2, H2 + + srdi F0, H1, 32 + sldi F1, H2, 32 + or F0, F1, F0 + clrrdi F1, H1, 32 + mr F2, H2 + clrldi H1, H1, 32 + + subfc T0, F0, T0 + addme F1, F1 + addme F2, F2 + addc T1, F1, T1 + adde H0, F2, H0 + addze H1, H1 + + std T0, 0(RP) + std T1, 8(RP) + std H0, 16(RP) + std H1, 24(RP) + + ld F2, -8(SP) + + blr +EPILOGUE(_nettle_ecc_secp224r1_modp)

-- 2.33.1

Amitay Isaacs

21 Jan 21 Jan

3:54 a.m.

New subject: [PATCH 2/7] ecc: Add powerpc64 assembly for ecc_224_modp

Updated version using actual register names for storing and restoring from stack.

Amitay.

-- The manager administers, the leader innovates. The manager maintains, the leader develops. The manager relies on systems, the leader relies on people. The manager counts on controls, the leader counts on trust. The manager does things right, the leader does the right thing. - Fortune Magazine

Amitay Isaacs

25 Nov 25 Nov

8:47 a.m.

New subject: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc

Signed-off-by: Amitay Isaacs amitay@ozlabs.org Signed-off-by: Martin Schwenke martin@meltin.net --- powerpc64/ecc-secp256r1-redc.asm | 144 +++++++++++++++++++++++++++++++ 1 file changed, 144 insertions(+) create mode 100644 powerpc64/ecc-secp256r1-redc.asm

diff --git a/powerpc64/ecc-secp256r1-redc.asm b/powerpc64/ecc-secp256r1-redc.asm new file mode 100644 index 00000000..59447567 --- /dev/null +++ b/powerpc64/ecc-secp256r1-redc.asm @@ -0,0 +1,144 @@ +C powerpc64/ecc-secp256r1-redc.asm + +ifelse(` + Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation + + Based on x86_64/ecc-secp256r1-redc.asm + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + +C Register usage: + +define(`SP', `r1') + +define(`RP', `r4') +define(`XP', `r5') + +define(`F0', `r3') +define(`F1', `r6') +define(`F2', `r7') +define(`F3', `r8') + +define(`U0', `r9') +define(`U1', `r10') +define(`U2', `r11') +define(`U3', `r12') +define(`U4', `r14') +define(`U5', `r15') +define(`U6', `r16') +define(`U7', `r17') + + .file "ecc-secp256r1-redc.asm" + +C FOLD(x), sets (F3,F2,F1,F0) <-- [(x << 224) - (x << 192) - (x << 96)] >> 64 +define(`FOLD', ` + sldi F2, $1, 32 + srdi F3, $1, 32 + li F0, 0 + li F1, 0 + subfc F0, F2, F0 + subfe F1, F3, F1 + subfe F2, $1, F2 + addme F3, F3 +') + + C void ecc_secp256r1_redc (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp256r1_redc) + + std U4,-32(SP) + std U5,-24(SP) + std U6,-16(SP) + std U7,-8(SP) + + ld U0, 0(XP) + ld U1, 8(XP) + ld U2, 16(XP) + ld U3, 24(XP) + ld U4, 32(XP) + ld U5, 40(XP) + ld U6, 48(XP) + ld U7, 56(XP) + + FOLD(U0) + subfc U1, F0, U1 + subfe U2, F1, U2 + subfe U3, F2, U3 + subfe U0, F3, U0 + + FOLD(U1) + subfc U2, F0, U2 + subfe U3, F1, U3 + subfe U4, F2, U4 + subfe U1, F3, U1 + + FOLD(U2) + subfc U3, F0, U3 + subfe U4, F1, U4 + subfe U5, F2, U5 + subfe U2, F3, U2 + + FOLD(U3) + subfc U4, F0, U4 + subfe U5, F1, U5 + subfe U6, F2, U6 + subfe U3, F3, U3 + + addc U0, U4, U0 + adde U1, U5, U1 + adde U2, U6, U2 + adde U3, U7, U3 + + C If carry, we need to add in + C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> + li F0, 0 + addze F0, F0 + neg F2, F0 + sldi F1, F2, 32 + srdi F3, F2, 32 + li U7, -2 + and F3, F3, U7 + + addc U0, F0, U0 + adde U1, F1, U1 + adde U2, F2, U2 + adde U3, F3, U3 + + std U0, 0(RP) + std U1, 8(RP) + std U2, 16(RP) + std U3, 24(RP) + + ld U4,-32(SP) + ld U5,-24(SP) + ld U6,-16(SP) + ld U7,-8(SP) + + blr +EPILOGUE(_nettle_ecc_secp256r1_redc)

-- 2.33.1

nisse＠lysator.liu.se

1 Dec 1 Dec

9:58 p.m.

New subject: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc

Amitay Isaacs amitay@ozlabs.org writes:

...

--- /dev/null +++ b/powerpc64/ecc-secp256r1-redc.asm @@ -0,0 +1,144 @@ +C powerpc64/ecc-secp256r1-redc.asm +ifelse(`

Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation

Based on x86_64/ecc-secp256r1-redc.asm

Looks good, and it seems method follows the x86_64 version closely. I just checked in a correction and a clarification to the comments to the x86_64 version.

A few comments below.

...

+C Register usage:

+define(`SP', `r1')

+define(`RP', `r4') +define(`XP', `r5')

+define(`F0', `r3') +define(`F1', `r6') +define(`F2', `r7') +define(`F3', `r8')

+define(`U0', `r9') +define(`U1', `r10') +define(`U2', `r11') +define(`U3', `r12') +define(`U4', `r14') +define(`U5', `r15') +define(`U6', `r16') +define(`U7', `r17')

One could save one register by letting U7 and XP overlap, since XP isn't used after loading U7.

...

.file "ecc-secp256r1-redc.asm"

+C FOLD(x), sets (F3,F2,F1,F0) <-- [(x << 224) - (x << 192) - (x << 96)] >> 64 +define(`FOLD', `

sldi F2, $1, 32

srdi F3, $1, 32

li F0, 0

li F1, 0

subfc F0, F2, F0

subfe F1, F3, F1

I think the

li F0, 0 li F1, 0 subfc F0, F2, F0 subfe F1, F3, F1

could be replaced with

subfic F0, F2, 0 C "negate with borrow" subfze F1, F3

If that is measurably faster, I can't say.

Another option: Since powerpc, like arm, seems to use the proper two's complement convention that "borrow" is not carry, maybe we don't need to negate to F0 and F1 at all, and instead change the later subtraction, replacing

subfc U1, F0, U1 subfe U2, F1, U2 subfe U3, F2, U3 subfe U0, F3, U0

with

addc U1, F0, U1 adde U2, F1, U2 subfe U3, F2, U3 subfe U0, F3, U0

I haven't thought that through, but it does make some sense to me. I think the arm code propagates carry through a mix of add and sub instructions in a some places. Maybe F2 needs to be incremented somewhere for this to work, but probably still cheaper. If this works, FOLD would turn into something like

sldi F0, $1, 32 srdi F1, $1, 32 subfc F2, $1, F0 addme F3, F1

(If you want to investigate this later on, that's fine too, I could merge the code with the current folding logic).

...

C If carry, we need to add in

C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1>

li F0, 0

addze F0, F0

neg F2, F0

sldi F1, F2, 32

srdi F3, F2, 32

li U7, -2

and F3, F3, U7

I think the three instructions to set F3 could be replaced with

srdi F3, F2, 31 sldi F3, F3, 1

Or maybe the and operation is faster than shift?

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

nisse＠lysator.liu.se

3 Dec 3 Dec

9:09 p.m.

New subject: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc

nisse@lysator.liu.se (Niels Möller) writes:

...

If this works, FOLD would turn into something like

sldi F0, $1, 32 srdi F1, $1, 32 subfc F2, $1, F0 addme F3, F1

I'm looking at a different approach (experimenting on ARM64, which is quite similar to powerpc, but I don't yet have working code). To understand what the redc code is doing we need to keep in mind that what one folding step does is to compute

<U4,U3,U2,U1,U0> + U0*p

which cancels the low limb, since p = -1 (mod 2^64). So since the low limb always cancel, what we need is

<U4,U3,U2,U1> + U0*((p+1)/2^64)

The x86_64 code does this by splitting U0*p into 2^{256} U0 - (2^{256} - p) * U0, subtracting in the folding step, and adding in the high part later. But one doesn't have to do it that way. One could instead use a FOLD macro that computes

(2^{192} - 2^{160} + 2^{128} + 2^{32}) U0

I also wonder of there's some way to use carry out from one fold step and apply it at the right place while preparing the F0,F1,F2,F3 for the next step.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

nisse＠lysator.liu.se

5 Dec 5 Dec

8:20 p.m.

New subject: ARM64 ecc_256_redc (was: Re: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc)

nisse@lysator.liu.se (Niels Möller) writes:

...

I'm looking at a different approach (experimenting on ARM64, which is quite similar to powerpc, but I don't yet have working code). To understand what the redc code is doing we need to keep in mind that what one folding step does is to compute

<U4,U3,U2,U1,U0> + U0*p

which cancels the low limb, since p = -1 (mod 2^64). So since the low limb always cancel, what we need is

<U4,U3,U2,U1> + U0*((p+1)/2^64)

The x86_64 code does this by splitting U0*p into 2^{256} U0 - (2^{256} - p) * U0, subtracting in the folding step, and adding in the high part later. But one doesn't have to do it that way. One could instead use a FOLD macro that computes

(2^{192} - 2^{160} + 2^{128} + 2^{32}) U0

I also wonder of there's some way to use carry out from one fold step and apply it at the right place while preparing the F0,F1,F2,F3 for the next step.

I've got this working now, attaching the version with early carry folding. Also checked in on the branch arm64-ecc. The preceding commit (5ee0839bb28c092044fce09534651b78640518c4) collects carries and adds them in as a separate pass over the data.

I've tested it only with Tested only with qemu-aarch64, help with benchmarking on real arm64 hardware appreciated (just add the file in the arm64/ directory and run ./config.status --recheck && ./config.status have the build pick it up).

I think the approach should apply to other 64-bit archs (should probably work also on x86_64, where it's sometimes tricky to avoid x86_64 instructions clobbering the carry flag when it should be preserved, but probably not so difficult in this case).

C arm64/ecc-secp256r1-redc.asm

This file is part of GNU Nettle.

GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:

* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

or both in parallel, as here.

GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')

.file "ecc-secp256r1-redc.asm"

define(`RP', `x1') define(`XP', `x2')

define(`U0', `x0') C Overlaps unused modulo input define(`U1', `x3') define(`U2', `x4') define(`U3', `x5') define(`U4', `x6') define(`U5', `x7') define(`U6', `x8') define(`U7', `x9') define(`F0', `x10') define(`F1', `x11') define(`F2', `x12') define(`F3', `x13') define(`ZERO', `x14')

C FOLD(x), sets (F3, F2,F1,F0 ) <-- (x << 192) - (x << 160) + (x << 128) + (x << 32) define(`FOLD', ` lsl F0, $1, #32 lsr F1, $1, #32 subs F2, $1, F0 sbc F3, $1, F1 ')

C FOLDC(x), sets (F3, F2,F1,F0) <-- ((x+c) << 192) - (x << 160) + (x << 128) + (x << 32) define(`FOLDC', ` lsl F0, $1, #32 lsr F1, $1, #32 adc F3, $1, ZERO C May overflow, but final result will not. subs F2, $1, F0 sbc F3, F3, F1 ')

PROLOGUE(_nettle_ecc_secp256r1_redc) ldr U0, [XP] ldr U1, [XP, #8] ldr U2, [XP, #16] ldr U3, [XP, #24] ldr U4, [XP, #32] ldr U5, [XP, #40] ldr U6, [XP, #48] ldr U7, [XP, #56] mov ZERO, #0

FOLD(U0) adds U1, U1, F0 adcs U2, U2, F1 adcs U3, U3, F2 adcs U4, U4, F3

FOLDC(U1) adds U2, U2, F0 adcs U3, U3, F1 adcs U4, U4, F2 adcs U5, U5, F3

FOLDC(U2) adds U3, U3, F0 adcs U4, U4, F1 adcs U5, U5, F2 adcs U6, U6, F3

FOLDC(U3) adds U4, U4, F0 adcs U5, U5, F1 adcs U6, U6, F2 adcs U7, U7, F3

C Sum, including carry, is < 2^{256} + p. C If carry, we need to add in 2^{256} mod p = 2^{256} - p C = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> C and this addition can not overflow. adc F0, ZERO, ZERO neg F2, F0 lsl F1, F2, #32 lsr F3, F2, #32 and F3, F3, #-2

adds U0, F0, U4 adcs U1, F1, U5 adcs U2, F2, U6 adc U3, F3, U7

str U0, [RP] str U1, [RP, #8] str U2, [RP, #16] str U3, [RP, #24]

ret EPILOGUE(_nettle_ecc_secp256r1_redc)

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

nisse＠lysator.liu.se

6 Dec 6 Dec

9:29 p.m.

New subject: x86_64 ecc_256_redc (was: Re: ARM64 ecc_256_redc)

nisse@lysator.liu.se (Niels Möller) writes:

...

I think the approach should apply to other 64-bit archs (should probably work also on x86_64, where it's sometimes tricky to avoid x86_64 instructions clobbering the carry flag when it should be preserved, but probably not so difficult in this case).

x86_64 version below. I could also trimmed register usage, so it no longer needs to save and restore any registers. On my machine, this gives a speedup of 17% for ecc_secp256r1_redc in isolation, 3% speedup for ecdsa sign and 7% speedup of ecdsa verify.

Regards, /Niels

C x86_64/ecc-secp256r1-redc.asm

This file is part of GNU Nettle.

GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:

* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

or both in parallel, as here.

You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')

.file "ecc-secp256r1-redc.asm"

define(`RP', `%rsi') define(`XP', `%rdx')

define(`U0', `%rdi') C Overlaps unused modulo input define(`U1', `%rcx') define(`U2', `%rax') define(`U3', `%r8') define(`F0', `%r9') define(`F1', `%r10') define(`F2', `%r11') define(`F3', `%rdx') C Overlap XP, used only in final carry folding

C FOLD(x), sets (x,F2,F1,F0 ) <-- (x << 192) - (x << 160) + (x << 128) + (x << 32) define(`FOLD', ` mov $1, F0 mov $1, F1 mov $1, F2 shl `$'32, F0 shr `$'32, F1 sub F0, F2 sbb F1, $1 ') C FOLDC(x), sets (x,F2,F1,F0) <-- ((x+c) << 192) - (x << 160) + (x << 128) + (x << 32) define(`FOLDC', ` mov $1, F0 mov $1, F1 mov $1, F2 adc `$'0, $1 shl `$'32, F0 shr `$'32, F1 sub F0, F2 sbb F1, $1 ') PROLOGUE(_nettle_ecc_secp256r1_redc) W64_ENTRY(3, 0)

mov (XP), U0 FOLD(U0) mov 8(XP), U1 mov 16(XP), U2 mov 24(XP), U3 add F0, U1 adc F1, U2 adc F2, U3 adc 32(XP), U0

FOLDC(U1) add F0, U2 adc F1, U3 adc F2, U0 adc 40(XP), U1

FOLDC(U2) add F0, U3 adc F1, U0 adc F2, U1 adc 48(XP), U2

FOLDC(U3) add F0, U0 adc F1, U1 adc F2, U2 adc 56(XP), U3

C Sum, including carry, is < 2^{256} + p. C If carry, we need to add in 2^{256} mod p = 2^{256} - p C = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> C and this addition can not overflow. sbb F2, F2 mov F2, F0 mov F2, F1 mov XREG(F2), XREG(F3) neg F0 shl $32, F1 and $-2, XREG(F3)

add F0, U0 mov U0, (RP) adc F1, U1 mov U1, 8(RP) adc F2, U2 mov U2, 16(RP) adc F3, U3

mov U3, 24(RP)

W64_EXIT(3, 0) ret EPILOGUE(_nettle_ecc_secp256r1_redc)

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Amitay Isaacs

7 Dec 7 Dec

7:37 a.m.

New subject: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)

Hi Niels,

On Mon, 2021-12-06 at 22:29 +0100, Niels Möller wrote:

...

nisse@lysator.liu.se (Niels Möller) writes:

...
I think the approach should apply to other 64-bit archs (should probably work also on x86_64, where it's sometimes tricky to avoid x86_64 instructions clobbering the carry flag when it should be preserved, but probably not so difficult in this case).

x86_64 version below. I could also trimmed register usage, so it no longer needs to save and restore any registers. On my machine, this gives a speedup of 17% for ecc_secp256r1_redc in isolation, 3% speedup for ecdsa sign and 7% speedup of ecdsa verify.

On POWER9, the new code gives ~20% speedup for ecc_secp256r1_redc in isolation, and ~1% speedup for ecdsa sign and verify over the earlier assembly version.

Amitay.

-- Do it ! Move it ! Make it happen ! No one ever sat their way to success.

nisse＠lysator.liu.se

8:25 p.m.

New subject: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)

Amitay Isaacs amitay@ozlabs.org writes:

...

On POWER9, the new code gives ~20% speedup for ecc_secp256r1_redc in isolation, and ~1% speedup for ecdsa sign and verify over the earlier assembly version.

Thanks! Merged to master-updates for ci testing.

I think it should be possible to reduce number of needed registers, and completely avoid using callee-save registers (load the values now in U4-U7 one at a time a bit closer to the place where they are needed in), and replace F3 with $1 in the FOLD and FOLDC macros.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

nisse＠lysator.liu.se

9 Dec 9 Dec

9:16 p.m.

New subject: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)

nisse@lysator.liu.se (Niels Möller) writes:

...

Thanks! Merged to master-updates for ci testing.

And now merged to the master branch.

...

I think it should be possible to reduce number of needed registers, and completely avoid using callee-save registers (load the values now in U4-U7 one at a time a bit closer to the place where they are needed in), and replace F3 with $1 in the FOLD and FOLDC macros.

Attaching a variant to do this. Passes tests with qemu, but I haven't benchmarked it on any real hardware.

C powerpc64/ecc-secp256r1-redc.asm

Based on x86_64/ecc-secp256r1-redc.asm

This file is part of GNU Nettle.

GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:

* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

or both in parallel, as here.

You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')

C Register usage:

define(`RP', `r4') define(`XP', `r5')

define(`F0', `r3') define(`F1', `r6') define(`F2', `r7') define(`T', `r8')

define(`U0', `r9') define(`U1', `r10') define(`U2', `r11') define(`U3', `r12')

.file "ecc-secp256r1-redc.asm"

C FOLD(x), sets (x,F2,F1,F0) <-- [(x << 192) - (x << 160) + (x << 128) + (x <<32)] define(`FOLD', ` sldi F0, $1, 32 srdi F1, $1, 32 subfc F2, F0, $1 subfe $1, F1, $1 ')

C FOLDC(x), sets (x,F2,F1,F0) <-- [((x+c) << 192) - (x << 160) + (x << 128) + (x <<32)] define(`FOLDC', ` sldi F0, $1, 32 srdi F1, $1, 32 addze T, $1 subfc F2, F0, $1 subfe $1, F1, T ')

C void ecc_secp256r1_redc (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) .text define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_ecc_secp256r1_redc)

ld U0, 0(XP) ld U1, 8(XP) ld U2, 16(XP) ld U3, 24(XP)

FOLD(U0) ld T, 32(XP) addc U1, F0, U1 adde U2, F1, U2 adde U3, F2, U3 adde U0, U0, T

FOLDC(U1) ld T, 40(XP) addc U2, F0, U2 adde U3, F1, U3 adde U0, F2, U0 adde U1, U1, T

FOLDC(U2) ld T, 48(XP) addc U3, F0, U3 adde U0, F1, U0 adde U1, F2, U1 adde U2, U2, T

FOLDC(U3) ld T, 56(XP) addc U0, F0, U0 adde U1, F1, U1 adde U2, F2, U2 adde U3, U3, T

C If carry, we need to add in C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> li F0, 0 addze F0, F0 neg F2, F0 sldi F1, F2, 32 srdi T, F2, 32 li XP, -2 and T, T, XP

addc U0, F0, U0 adde U1, F1, U1 adde U2, F2, U2 adde U3, T, U3

std U0, 0(RP) std U1, 8(RP) std U2, 16(RP) std U3, 24(RP)

blr EPILOGUE(_nettle_ecc_secp256r1_redc)

...

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

nisse＠lysator.liu.se

4 Jan 4 Jan

7:54 p.m.

New subject: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)

nisse@lysator.liu.se (Niels Möller) writes:

...

nisse@lysator.liu.se (Niels Möller) writes:

...
I think it should be possible to reduce number of needed registers, and completely avoid using callee-save registers (load the values now in U4-U7 one at a time a bit closer to the place where they are needed in), and replace F3 with $1 in the FOLD and FOLDC macros.

Attaching a variant to do this. Passes tests with qemu, but I haven't benchmarked it on any real hardware.

Would you like to test and benchmark this on relevant real hardware, before I merged this version?

Code still below, and committed to the branch ppc-secp256-tweaks.

Regards, /Niels

C powerpc64/ecc-secp256r1-redc.asm

Based on x86_64/ecc-secp256r1-redc.asm

This file is part of GNU Nettle.

GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:

* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

or both in parallel, as here.

You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')

C Register usage:

define(`RP', `r4') define(`XP', `r5')

define(`F0', `r3') define(`F1', `r6') define(`F2', `r7') define(`T', `r8')

define(`U0', `r9') define(`U1', `r10') define(`U2', `r11') define(`U3', `r12')

.file "ecc-secp256r1-redc.asm"

C FOLD(x), sets (x,F2,F1,F0) <-- [(x << 192) - (x << 160) + (x << 128) + (x <<32)] define(`FOLD', ` sldi F0, $1, 32 srdi F1, $1, 32 subfc F2, F0, $1 subfe $1, F1, $1 ')

C FOLDC(x), sets (x,F2,F1,F0) <-- [((x+c) << 192) - (x << 160) + (x << 128) + (x <<32)] define(`FOLDC', ` sldi F0, $1, 32 srdi F1, $1, 32 addze T, $1 subfc F2, F0, $1 subfe $1, F1, T ')

C void ecc_secp256r1_redc (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) .text define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_ecc_secp256r1_redc)

ld U0, 0(XP) ld U1, 8(XP) ld U2, 16(XP) ld U3, 24(XP)

FOLD(U0) ld T, 32(XP) addc U1, F0, U1 adde U2, F1, U2 adde U3, F2, U3 adde U0, U0, T

FOLDC(U1) ld T, 40(XP) addc U2, F0, U2 adde U3, F1, U3 adde U0, F2, U0 adde U1, U1, T

FOLDC(U2) ld T, 48(XP) addc U3, F0, U3 adde U0, F1, U0 adde U1, F2, U1 adde U2, U2, T

FOLDC(U3) ld T, 56(XP) addc U0, F0, U0 adde U1, F1, U1 adde U2, F2, U2 adde U3, U3, T

C If carry, we need to add in C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> li F0, 0 addze F0, F0 neg F2, F0 sldi F1, F2, 32 srdi T, F2, 32 li XP, -2 and T, T, XP

addc U0, F0, U0 adde U1, F1, U1 adde U2, F2, U2 adde U3, T, U3

std U0, 0(RP) std U1, 8(RP) std U2, 16(RP) std U3, 24(RP)

blr EPILOGUE(_nettle_ecc_secp256r1_redc)

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Amitay Isaacs

10 Jan 10 Jan

5:46 a.m.

New subject: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)

Hi Niels,

On Tue, 2022-01-04 at 20:54 +0100, Niels Möller wrote:

...

nisse@lysator.liu.se (Niels Möller) writes:

...
nisse@lysator.liu.se (Niels Möller) writes:

...
I think it should be possible to reduce number of needed registers, and completely avoid using callee-save registers (load the values now in U4-U7 one at a time a bit closer to the place where they are needed in), and replace F3 with $1 in the FOLD and FOLDC macros.

Attaching a variant to do this. Passes tests with qemu, but I haven't benchmarked it on any real hardware.

Would you like to test and benchmark this on relevant real hardware, before I merged this version?

Code still below, and committed to the branch ppc-secp256-tweaks.

Compared to the current version in master branch, this version definitely improves the performance of the reduction code.

On POWER9, the reduction code shows 7% speed up when tested separately.

The improvement in P256 sign/verify is marginal. Here are the numbers from hogweed-benchmark on POWER9.

name size sign/ms verify/ms ecdsa 256 11.1013 3.5713 (master) ecdsa 256 11.1527 3.6011 (this patch)

Amitay.

-- People on the net are always telling other people to "get a life." It would be so much simlper if there were on available under GPL. "If you use this life, you must tell other people where to get a life of their own." - Christopher Davis

nisse＠lysator.liu.se

7:23 p.m.

New subject: powerpc ecc 256 redc (was Re: x86_64 ecc_256_redc)

Amitay Isaacs amitay@ozlabs.org writes:

...

Compared to the current version in master branch, this version definitely improves the performance of the reduction code.

On POWER9, the reduction code shows 7% speed up when tested separately.

The improvement in P256 sign/verify is marginal. Here are the numbers from hogweed-benchmark on POWER9.
        name size   sign/ms verify/ms
       ecdsa  256   11.1013    3.5713  (master)
       ecdsa  256   11.1527    3.6011  (this patch)

Thanks for testing. Committed to the master branch now.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Amitay Isaacs

7 Dec 7 Dec

7:44 a.m.

New subject: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc

On Wed, 2021-12-01 at 22:58 +0100, Niels Möller wrote:

...

Amitay Isaacs amitay@ozlabs.org writes:

...
--- /dev/null +++ b/powerpc64/ecc-secp256r1-redc.asm @@ -0,0 +1,144 @@ +C powerpc64/ecc-secp256r1-redc.asm +ifelse(` +   Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation

+   Based on x86_64/ecc-secp256r1-redc.asm

Looks good, and it seems method follows the x86_64 version closely. I just checked in a correction and a clarification to the comments to the x86_64 version.

A few comments below.

...
+C Register usage:

+define(`SP', `r1')

+define(`RP', `r4') +define(`XP', `r5')

+define(`F0', `r3') +define(`F1', `r6') +define(`F2', `r7') +define(`F3', `r8')

+define(`U0', `r9') +define(`U1', `r10') +define(`U2', `r11') +define(`U3', `r12') +define(`U4', `r14') +define(`U5', `r15') +define(`U6', `r16') +define(`U7', `r17')

One could save one register by letting U7 and XP overlap, since XP isn't used after loading U7.

...
+       .file "ecc-secp256r1-redc.asm"

+C FOLD(x), sets (F3,F2,F1,F0) <-- [(x << 224) - (x << 192) - (x << 96)] >> 64 +define(`FOLD', ` +       sldi    F2, $1, 32 +       srdi    F3, $1, 32 +       li      F0, 0 +       li      F1, 0 +       subfc   F0, F2, F0 +       subfe   F1, F3, F1

I think the

li      F0, 0         li      F1, 0         subfc   F0, F2, F0         subfe   F1, F3, F1

could be replaced with

subfic  F0, F2, 0    C "negate with borrow"         subfze  F1, F3

If that is measurably faster, I can't say.

You are quick to find the exactly fitting instruction. Yes, it definitely does the same job with two less instructions and gives about 1% speedup for only reduction code.

...

Another option: Since powerpc, like arm, seems to use the proper two's complement convention that "borrow" is not carry, maybe we don't need to negate to F0 and F1 at all, and instead change the later subtraction, replacing

subfc   U1, F0, U1         subfe   U2, F1, U2         subfe   U3, F2, U3         subfe   U0, F3, U0

with

addc    U1, F0, U1         adde    U2, F1, U2         subfe   U3, F2, U3         subfe   U0, F3, U0

I haven't thought that through, but it does make some sense to me. I think the arm code propagates carry through a mix of add and sub instructions in a some places. Maybe F2 needs to be incremented somewhere for this to work, but probably still cheaper. If this works, FOLD would turn into something like

sldi    F0, $1, 32         srdi    F1, $1, 32         subfc   F2, $1, F0         addme   F3, F1

(If you want to investigate this later on, that's fine too, I could merge the code with the current folding logic).

...
+       C If carry, we need to add in +       C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> +       li      F0, 0 +       addze   F0, F0 +       neg     F2, F0 +       sldi    F1, F2, 32 +       srdi    F3, F2, 32 +       li      U7, -2 +       and     F3, F3, U7

I think the three instructions to set F3 could be replaced with

srdi    F3, F2, 31         sldi    F3, F3, 1

Or maybe the and operation is faster than shift?

Regards, /Niels

I will continue to investigate the suggestions you have made.

Amitay.

-- There are two times in a man's life when he should not speculate: when he can't afford it, and when he can. - Mark Twain

Amitay Isaacs

25 Nov 25 Nov

8:47 a.m.

New subject: [PATCH 4/7] ecc: Add powerpc64 assembly for ecc_384_modp

From: Martin Schwenke martin@meltin.net

Signed-off-by: Martin Schwenke martin@meltin.net Signed-off-by: Amitay Isaacs amitay@ozlabs.org Signed-off-by: Alastair D'Silva alastair@d-silva.org --- powerpc64/ecc-secp384r1-modp.asm | 227 +++++++++++++++++++++++++++++++ 1 file changed, 227 insertions(+) create mode 100644 powerpc64/ecc-secp384r1-modp.asm

diff --git a/powerpc64/ecc-secp384r1-modp.asm b/powerpc64/ecc-secp384r1-modp.asm new file mode 100644 index 00000000..67791f09 --- /dev/null +++ b/powerpc64/ecc-secp384r1-modp.asm @@ -0,0 +1,227 @@ +C powerpc64/ecc-secp384r1-modp.asm + +ifelse(` + Copyright (C) 2021 Martin Schwenke, Amitay Isaacs & Alastair D´Silva, IBM Corporation + + Based on x86_64/ecc-secp256r1-redc.asm + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + .file "ecc-secp384r1-modp.asm" + +C Register usage: + +define(`SP', `r1') + +define(`RP', `r4') +define(`XP', `r5') + +define(`D5', `r6') +define(`T0', `r7') +define(`T1', `r8') +define(`T2', `r9') +define(`T3', `r10') +define(`T4', `r11') +define(`T5', `r12') +define(`H0', `r14') +define(`H1', `r15') +define(`H2', `r16') +define(`H3', `r17') +define(`H4', `r18') +define(`H5', `r19') +define(`C2', `r3') +define(`C0', H5) C Overlap +define(`TMP', XP) C Overlap + + + C void ecc_secp384r1_modp (const struct ecc_modulo *m, mp_limb_t *rp, mp_limb_t *xp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp384r1_modp) + + std H0, -48(SP) + std H1, -40(SP) + std H2, -32(SP) + std H3, -24(SP) + std H4, -16(SP) + std H5, -8(SP) + + C First get top 2 limbs, which need folding twice. + C B^10 = B^6 + B^4 + 2^32 (B-1)B^4. + C We handle the terms as follow: + C + C B^6: Folded immediatly. + C + C B^4: Delayed, added in in the next folding. + C + C 2^32(B-1) B^4: Low half limb delayed until the next + C folding. Top 1.5 limbs subtracted and shifter now, resulting + C in 2.5 limbs. The low limb saved in D5, high 1.5 limbs added + C in. + + ld H4, 80(XP) + ld H5, 88(XP) + C Shift right 32 bits, into H1, H0 + srdi H1, H5, 32 + sldi D5, H5, 32 + srdi H0, H4, 32 + or H0, H0, D5 + + C H1 H0 + C - H1 H0 + C -------- + C H1 H0 D5 + subfic D5, H0, 0 + subfe H0, H1, H0 + addme H1, H1 + + li C2, 0 + addc H0, H4, H0 + adde H1, H5, H1 + addze C2, C2 + + C Add in to high part + ld T1, 48(XP) + ld T2, 56(XP) + addc H0, T1, H0 + adde H1, T2, H1 + addze C2, C2 C Do C2 later + + C +1 term + ld T0, 0(XP) + ld T1, 8(XP) + ld T2, 16(XP) + ld T3, 24(XP) + ld T4, 32(XP) + ld T5, 40(XP) + ld H2, 64(XP) + ld H3, 72(XP) + addc T0, H0, T0 + adde T1, H1, T1 + adde T2, H2, T2 + adde T3, H3, T3 + adde T4, H4, T4 + adde T5, H5, T5 + li C0, 0 + addze C0, C0 + + C +B^2 term + addc T2, H0, T2 + adde T3, H1, T3 + adde T4, H2, T4 + adde T5, H3, T5 + addze C0, C0 + + C Shift left, including low half of H4 + sldi H4, H4, 32 + srdi TMP, H3, 32 + or H4, TMP, H4 + + sldi H3, H3, 32 + srdi TMP, H2, 32 + or H3, TMP, H3 + + sldi H2, H2, 32 + srdi TMP, H1, 32 + or H2, TMP, H2 + + sldi H1, H1, 32 + srdi TMP, H0, 32 + or H1, TMP, H1 + + sldi H0, H0, 32 + + C H4 H3 H2 H1 H0 0 + C - H4 H3 H2 H1 H0 + C --------------- + C H4 H3 H2 H1 H0 TMP + + subfic TMP, H0, 0 + subfe H0, H1, H0 + subfe H1, H2, H1 + subfe H2, H3, H2 + subfe H3, H4, H3 + addme H4, H4 + + addc T0, TMP, T0 + adde T1, H0, T1 + adde T2, H1, T2 + adde T3, H2, T3 + adde T4, H3, T4 + adde T5, H4, T5 + addze C0, C0 + + C Remains to add in C2 and C0 + C Set H1, H0 = (2^96 - 2^32 + 1) C0 + sldi H1, C0, 32 + subfc H0, H1, C0 + addme H1, H1 + + C Set H3, H2 = (2^96 - 2^32 + 1) C2 + sldi H3, C2, 32 + subfc H2, H3, C2 + addme H3, H3 + addc H2, C0, H2 + + li C0, 0 + addc T0, H0, T0 + adde T1, H1, T1 + adde T2, H2, T2 + adde T3, H3, T3 + adde T4, C2, T4 + adde T5, D5, T5 C Value delayed from initial folding + addze C0, C0 + + C Final unlikely carry + sldi H1, C0, 32 + subfc H0, H1, C0 + addme H1, H1 + + addc T0, H0, T0 + adde T1, H1, T1 + adde T2, C0, T2 + addze T3, T3 + addze T4, T4 + addze T5, T5 + + std T0, 0(RP) + std T1, 8(RP) + std T2, 16(RP) + std T3, 24(RP) + std T4, 32(RP) + std T5, 40(RP) + + ld H0, -48(SP) + ld H1, -40(SP) + ld H2, -32(SP) + ld H3, -24(SP) + ld H4, -16(SP) + ld H5, -8(SP) + + blr +EPILOGUE(_nettle_ecc_secp384r1_modp)

-- 2.33.1

nisse＠lysator.liu.se

4 Jan 4 Jan

8:28 p.m.

New subject: [PATCH 4/7] ecc: Add powerpc64 assembly for ecc_384_modp

Amitay Isaacs amitay@ozlabs.org writes:

...

diff --git a/powerpc64/ecc-secp384r1-modp.asm b/powerpc64/ecc-secp384r1-modp.asm new file mode 100644 index 00000000..67791f09 --- /dev/null +++ b/powerpc64/ecc-secp384r1-modp.asm @@ -0,0 +1,227 @@ +C powerpc64/ecc-secp384r1-modp.asm

This looks nice (and it seems folding scheme is the same as for the x86_64 version). Just one minor thing,

...

+define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp384r1_modp)

std H0, -48(SP)

std H1, -40(SP)

std H2, -32(SP)

std H3, -24(SP)

std H4, -16(SP)

std H5, -8(SP)

I find it clearer to use register names rather than the m4 defines for save and restore of callee-save registers.

Regards, /Niels

-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

Amitay Isaacs

21 Jan 21 Jan

3:53 a.m.

New subject: [PATCH 4/7] ecc: Add powerpc64 assembly for ecc_384_modp

On Tue, 2022-01-04 at 21:28 +0100, Niels Möller wrote:

...

...
+define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp384r1_modp)

+ std H0, -48(SP) + std H1, -40(SP) + std H2, -32(SP) + std H3, -24(SP) + std H4, -16(SP) + std H5, -8(SP)

I find it clearer to use register names rather than the m4 defines for save and restore of callee-save registers.

Here's the modified code which uses the actual registers when saving and restoring from stack.

Amitay.

-- Before marriage, a man yearns for the woman he loves. After marriage, the 'Y' becomes silent.

Amitay Isaacs

25 Nov 25 Nov

8:47 a.m.

New subject: [PATCH 5/7] ecc: Add powerpc64 assembly for ecc_521_modp

From: Martin Schwenke martin@meltin.net

Signed-off-by: Martin Schwenke martin@meltin.net Signed-off-by: Alastair D'Silva alastair@d-silva.org --- powerpc64/ecc-secp521r1-modp.asm | 166 +++++++++++++++++++++++++++++++ 1 file changed, 166 insertions(+) create mode 100644 powerpc64/ecc-secp521r1-modp.asm

diff --git a/powerpc64/ecc-secp521r1-modp.asm b/powerpc64/ecc-secp521r1-modp.asm new file mode 100644 index 00000000..a9d197c9 --- /dev/null +++ b/powerpc64/ecc-secp521r1-modp.asm @@ -0,0 +1,166 @@ +C powerpc64/ecc-secp521r1-modp.asm + +ifelse(` + Copyright (C) 2021 Martin Schwenke & Alastair D´Silva, IBM Corporation + + Based on x86_64/ecc-secp521r1-modp.asm + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + .file "ecc-secp521r1-modp.asm" + +define(`SP', `r1') + +define(`RP', `r4') +define(`XP', `r5') + +define(`U0', `r6') +define(`U1', `r7') +define(`U2', `r8') +define(`U3', `r9') +define(`U4', `r10') +define(`U5', `r11') +define(`U6', `r12') +define(`U7', `r14') +define(`U8', `r15') +define(`U9', `r16') + +define(`T0', `r3') +define(`T1', `r17') + + + C void ecc_secp521r1_modp (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_secp521r1_modp) + + std U7, -32(SP) + std U8, -24(SP) + std U9, -16(SP) + std T1, -8(SP) + + C Read top 17 limbs, shift left 55 bits + ld U1, 72(XP) + sldi U0, U1, 55 + srdi U1, U1, 9 + + ld T0, 80(XP) + srdi U2, T0, 9 + sldi T0, T0, 55 + or U1, T0, U1 + + ld T0, 88(XP) + srdi U3, T0, 9 + sldi T0, T0, 55 + or U2, T0, U2 + + ld T0, 96(XP) + srdi U4, T0, 9 + sldi T0, T0, 55 + or U3, T0, U3 + + ld T0, 104(XP) + srdi U5, T0, 9 + sldi T0, T0, 55 + or U4, T0, U4 + + ld T0, 112(XP) + srdi U6, T0, 9 + sldi T0, T0, 55 + or U5, T0, U5 + + ld T0, 120(XP) + srdi U7, T0, 9 + sldi T0, T0, 55 + or U6, T0, U6 + + ld T0, 128(XP) + srdi U8, T0, 9 + sldi T0, T0, 55 + or U7, T0, U7 + + ld T0, 136(XP) + srdi U9, T0, 9 + sldi T0, T0, 55 + or U8, T0, U8 + + ld T0, 0(XP) + ld T1, 8(XP) + addc U0, T0, U0 + adde U1, T1, U1 + ld T0, 16(XP) + ld T1, 24(XP) + adde U2, T0, U2 + adde U3, T1, U3 + ld T0, 32(XP) + ld T1, 40(XP) + adde U4, T0, U4 + adde U5, T1, U5 + ld T0, 48(XP) + ld T1, 56(XP) + adde U6, T0, U6 + adde U7, T1, U7 + ld T0, 64(XP) + adde U8, T0, U8 + addze U9, U9 + + C Top limbs are <U9, U8>. Keep low 9 bits of 8, and fold the + C top bits (at most 65 bits). + srdi T0, U8, 9 + andi. U8, U8, 0x1ff + srdi T1, U9, 9 + sldi U9, U9, 55 + or T0, U9, T0 + + addc U0, T0, U0 + adde U1, T1, U1 + addze U2, U2 + addze U3, U3 + addze U4, U4 + addze U5, U5 + addze U6, U6 + addze U7, U7 + addze U8, U8 + + std U0, 0(RP) + std U1, 8(RP) + std U2, 16(RP) + std U3, 24(RP) + std U4, 32(RP) + std U5, 40(RP) + std U6, 48(RP) + std U7, 56(RP) + std U8, 64(RP) + + ld U7, -32(SP) + ld U8, -24(SP) + ld U9, -16(SP) + ld T1, -8(SP) + + blr +EPILOGUE(_nettle_ecc_secp521r1_modp)

-- 2.33.1

Amitay Isaacs

8:47 a.m.

New subject: [PATCH 6/7] ecc: Add powerpc64 assembly for ecc_25519_modp

From: Martin Schwenke martin@meltin.net

Signed-off-by: Martin Schwenke martin@meltin.net Signed-off-by: Alastair D'Silva alastair@d-silva.org --- powerpc64/ecc-curve25519-modp.asm | 103 ++++++++++++++++++++++++++++++ 1 file changed, 103 insertions(+) create mode 100644 powerpc64/ecc-curve25519-modp.asm

diff --git a/powerpc64/ecc-curve25519-modp.asm b/powerpc64/ecc-curve25519-modp.asm new file mode 100644 index 00000000..4f9f8f73 --- /dev/null +++ b/powerpc64/ecc-curve25519-modp.asm @@ -0,0 +1,103 @@ +C powerpc64/ecc-25519-modp.asm + +ifelse(` + Copyright (C) 2021 Martin Schwenke & Alastair D´Silva, IBM Corporation + + Based on x86_64/ecc-25519-modp.asm + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + .file "ecc-25519-modp.asm" + +define(`SP', `r1') + +define(`RP', `r4') +define(`XP', `r5') + +define(`U0', `r6') C Overlaps unused modulo input +define(`U1', `r7') +define(`U2', `r8') +define(`U3', `r9') +define(`T0', `r10') +define(`T1', `r11') +define(`M', `r12') + +define(`UN', r3) + + C void ecc_curve25519_modp (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_curve25519_modp) + + C First fold the limbs affecting bit 255 + ld UN, 56(XP) + li M, 38 + mulhdu T1, M, UN + mulld UN, M, UN + ld U3, 24(XP) + li T0, 0 + addc U3, UN, U3 + adde T0, T1, T0 + + ld UN, 40(XP) + mulhdu U2, M, UN + mulld UN, M, UN + + addc U3, U3, U3 + adde T0, T0, T0 + srdi U3, U3, 1 C Undo shift, clear high bit + + C Fold the high limb again, together with RP[5] + li T1, 19 + mulld T0, T1, T0 + ld U0, 0(XP) + ld U1, 8(XP) + ld T1, 16(XP) + addc U0, T0, U0 + adde U1, UN, U1 + ld T0, 32(XP) + adde U2, U2, T1 + addze U3, U3 + + mulhdu T1, M, T0 + mulld T0, M, T0 + addc U0, T0, U0 + adde U1, T1, U1 + std U0, 0(RP) + std U1, 8(RP) + + ld T0, 48(XP) + mulhdu T1, M, T0 + mulld UN, M, T0 + adde U2, UN, U2 + adde U3, T1, U3 + std U2, 16(RP) + std U3, 24(RP) + + blr +EPILOGUE(_nettle_ecc_curve25519_modp)

-- 2.33.1

Amitay Isaacs

8:47 a.m.

New subject: [PATCH 7/7] ecc: Add powerpc64 assembly for ecc_448_modp

From: Martin Schwenke martin@meltin.net

Signed-off-by: Martin Schwenke martin@meltin.net Signed-off-by: Amitay Isaacs amitay@gmail.com --- powerpc64/ecc-curve448-modp.asm | 174 ++++++++++++++++++++++++++++++++ 1 file changed, 174 insertions(+) create mode 100644 powerpc64/ecc-curve448-modp.asm

diff --git a/powerpc64/ecc-curve448-modp.asm b/powerpc64/ecc-curve448-modp.asm new file mode 100644 index 00000000..411c8df7 --- /dev/null +++ b/powerpc64/ecc-curve448-modp.asm @@ -0,0 +1,174 @@ +C powerpc/ecc-curve448-modp.asm + +ifelse(` + Copyright (C) 2021 Martin Schwenke & Amitay Isaacs, IBM Corporation + + Based on x86_64/ecc-curve448-modp.asm + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + .file "ecc-curve448-modp.asm" + +define(`SP', `r1') + +define(`RP', `r4') +define(`XP', `r5') + +define(`X0', `r3') +define(`X1', `r9') +define(`X2', `r10') +define(`X3', `r11') +define(`X4', `r12') +define(`X5', `r14') +define(`X6', `r15') +define(`X7', `r16') +define(`T0', `r6') +define(`T1', `r7') +define(`T2', `r8') +define(`TT', `r17') + +define(`LO', `TT') C Overlap + + C void ecc_curve448_modp (const struct ecc_modulo *p, mp_limb_t *rp, mp_limb_t *xp) + .text +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_ecc_curve448_modp) + + std X5, -32(SP) + std X6, -24(SP) + std X7, -16(SP) + std TT, -8(SP) + + C First load the values to be shifted by 32. + ld T0, 88(XP) C use for X0, X1 + ld T1, 96(XP) C use for X2 + ld T2, 104(XP) C use for X3 + ld X4, 56(XP) + ld X5, 64(XP) + ld X6, 72(XP) + ld X7, 80(XP) + + C Multiply by 2^32 + sldi X0, T0, 32 + srdi LO, T0, 32 + sldi X1, T1, 32 + or X1, X1, LO + srdi LO, T1, 32 + sldi X2, T2, 32 + or X2, X2, LO + srdi LO, T2, 32 + sldi X3, X4, 32 + or X3, X3, LO + srdi LO, X4, 32 + sldi X4, X5, 32 + or X4, X4, LO + srdi LO, X5, 32 + sldi X5, X6, 32 + or X5, X5, LO + srdi LO, X6, 32 + sldi X6, X7, 32 + or X6, X6, LO + + srdi X7, X7, 32 + + C Multiply by 2 + addc T0, T0, T0 + adde T1, T1, T1 + adde T2, T2, T2 + addze X7, X7 + + C Main additions + ld TT, 56(XP) + addc X0, TT, X0 + ld TT, 64(XP) + adde X1, TT, X1 + ld TT, 72(XP) + adde X2, TT, X2 + ld TT, 80(XP) + adde X3, TT, X3 + adde X4, T0, X4 + adde X5, T1, X5 + adde X6, T2, X6 + addze X7, X7 + + ld T0, 0(XP) + addc X0, T0, X0 + ld T1, 8(XP) + adde X1, T1, X1 + ld T2, 16(XP) + adde X2, T2, X2 + ld TT, 24(XP) + adde X3, TT, X3 + ld T0, 32(XP) + adde X4, T0, X4 + ld T1, 40(XP) + adde X5, T1, X5 + ld T2, 48(XP) + adde X6, T2, X6 + addze X7, X7 + + C X7 wraparound + sldi T0, X7, 32 + srdi T1, X7, 32 + li T2, 0 + addc X0, X7, X0 + addze X1, X1 + addze X2, X2 + adde X3, T0, X3 + adde X4, T1, X4 + addze X5, X5 + addze X6, X6 + addze T2, T2 + + C Final carry wraparound. Carry T2 > 0 only if + C X6 is zero, so carry is absorbed. + sldi T0, T2, 32 + + addc X0, T2, X0 + addze X1, X1 + addze X2, X2 + adde X3, T0, X3 + addze X4, X4 + addze X5, X5 + addze X6, X6 + + std X0, 0(RP) + std X1, 8(RP) + std X2, 16(RP) + std X3, 24(RP) + std X4, 32(RP) + std X5, 40(RP) + std X6, 48(RP) + + ld X5, -32(SP) + ld X6, -24(SP) + ld X7, -16(SP) + ld TT, -8(SP) + + blr +EPILOGUE(_nettle_ecc_curve448_modp)

-- 2.33.1

nisse＠lysator.liu.se

28 Nov 28 Nov

10:17 a.m.

Amitay Isaacs amitay@ozlabs.org writes:

...

This series of patches add the powerpc64 assembly for modp/redc functions for elliptic curves P192, P224, P256, P384, P521, X25519 and X448. It results in 15-30% performance improvements as measured on POWER9 system using hogweed-benchmark.

Nice. For testing these functions, I recommend running

while NETTLE_TEST_SEED=0 ./testsuite/ecc-mod-test ; do : ; done

and

while NETTLE_TEST_SEED=0 ./testsuite/ecc-redc-test ; do : ; done

for a few hours.

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

Amitay Isaacs

30 Nov 30 Nov

2:26 a.m.

On Sun, 2021-11-28 at 11:17 +0100, Niels Möller wrote:

...

Amitay Isaacs amitay@ozlabs.org writes:

...
This series of patches add the powerpc64 assembly for modp/redc functions for elliptic curves P192, P224, P256, P384, P521, X25519 and X448. It results in 15-30% performance improvements as measured on POWER9 system using hogweed-benchmark.

Nice. For testing these functions, I recommend running

while NETTLE_TEST_SEED=0 ./testsuite/ecc-mod-test ; do : ; done

and

while NETTLE_TEST_SEED=0 ./testsuite/ecc-redc-test ; do : ; done

for a few hours.

Regards, /Niels

I have both tests running for more than 24 hours without failure.

Is there anything else I need to do?

Thanks.

Amitay.

1279

Age (days ago)

1336

Last active (days ago)

nettle-bugs@lists.lysator.liu.se

26 comments

2 participants

tags (0)

participants (2)

Amitay Isaacs
nisse＠lysator.liu.se