Re: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc

7 Dec 2021

On Wed, 2021-12-01 at 22:58 +0100, Niels Möller wrote:
...
Amitay Isaacs amitay@ozlabs.org writes:
...

--- /dev/null
+++ b/powerpc64/ecc-secp256r1-redc.asm
@@ -0,0 +1,144 @@
+C powerpc64/ecc-secp256r1-redc.asm
+ifelse(`
+   Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM
Corporation



+   Based on x86_64/ecc-secp256r1-redc.asm
Looks good, and it seems method follows the x86_64 version closely. I
just checked in a correction and a clarification to the comments to
the
x86_64 version.
A few comments below.
...
+C Register usage:



+define(`SP', `r1')



+define(`RP', `r4')
+define(`XP', `r5')



+define(`F0', `r3')
+define(`F1', `r6')
+define(`F2', `r7')
+define(`F3', `r8')



+define(`U0', `r9')
+define(`U1', `r10')
+define(`U2', `r11')
+define(`U3', `r12')
+define(`U4', `r14')
+define(`U5', `r15')
+define(`U6', `r16')
+define(`U7', `r17')
One could save one register by letting U7 and XP overlap, since XP
isn't
used after loading U7.
...
+       .file "ecc-secp256r1-redc.asm"



+C FOLD(x), sets (F3,F2,F1,F0)  <-- [(x << 224) - (x << 192) - (x
<< 96)] >> 64
+define(`FOLD', `
+       sldi    F2, $1, 32
+       srdi    F3, $1, 32
+       li      F0, 0
+       li      F1, 0
+       subfc   F0, F2, F0
+       subfe   F1, F3, F1
I think the
li      F0, 0
        li      F1, 0
        subfc   F0, F2, F0
        subfe   F1, F3, F1
could be replaced with
subfic  F0, F2, 0    C "negate with borrow"
        subfze  F1, F3
If that is measurably faster, I can't say.
You are quick to find the exactly fitting instruction.  Yes, it
definitely does the same job with two less instructions and gives about
1% speedup for only reduction code.
...
Another option: Since powerpc, like arm, seems to use the proper
two's
complement convention that "borrow" is not carry, maybe we don't need
to
negate to F0 and F1 at all, and instead change the later subtraction,
replacing
subfc   U1, F0, U1
        subfe   U2, F1, U2
        subfe   U3, F2, U3
        subfe   U0, F3, U0
with
addc    U1, F0, U1
        adde    U2, F1, U2
        subfe   U3, F2, U3
        subfe   U0, F3, U0
I haven't thought that through, but it does make some sense to me. I
think the arm code propagates carry through a mix of add and sub
instructions in a some places. Maybe F2 needs to be incremented
somewhere for this to work, but probably still cheaper. If this
works,
FOLD would turn into something like
sldi    F0, $1, 32
        srdi    F1, $1, 32
        subfc   F2, $1, F0
        addme   F3, F1
(If you want to investigate this later on, that's fine too, I could
merge
the code with the current folding logic).
...
+       C If carry, we need to add in
+       C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1>
+       li      F0, 0
+       addze   F0, F0
+       neg     F2, F0
+       sldi    F1, F2, 32
+       srdi    F3, F2, 32
+       li      U7, -2
+       and     F3, F3, U7
I think the three instructions to set F3 could be replaced with
srdi    F3, F2, 31
        sldi    F3, F3, 1
Or maybe the and operation is faster than shift?
Regards,
/Niels
I will continue to investigate the suggestions you have made.
Amitay.
-- 

There are two times in a man's life when he should not speculate: when
he
can't afford it, and when he can. - Mark Twain

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [PATCH 3/7] ecc: Add powerpc64 assembly for ecc_256_redc