On Wed, 2021-12-01 at 22:58 +0100, Niels Möller wrote:
Amitay Isaacs amitay@ozlabs.org writes:
--- /dev/null +++ b/powerpc64/ecc-secp256r1-redc.asm @@ -0,0 +1,144 @@ +C powerpc64/ecc-secp256r1-redc.asm +ifelse(` + Copyright (C) 2021 Amitay Isaacs & Martin Schwenke, IBM Corporation
+ Based on x86_64/ecc-secp256r1-redc.asm
Looks good, and it seems method follows the x86_64 version closely. I just checked in a correction and a clarification to the comments to the x86_64 version.
A few comments below.
+C Register usage:
+define(`SP', `r1')
+define(`RP', `r4') +define(`XP', `r5')
+define(`F0', `r3') +define(`F1', `r6') +define(`F2', `r7') +define(`F3', `r8')
+define(`U0', `r9') +define(`U1', `r10') +define(`U2', `r11') +define(`U3', `r12') +define(`U4', `r14') +define(`U5', `r15') +define(`U6', `r16') +define(`U7', `r17')
One could save one register by letting U7 and XP overlap, since XP isn't used after loading U7.
+ .file "ecc-secp256r1-redc.asm"
+C FOLD(x), sets (F3,F2,F1,F0) <-- [(x << 224) - (x << 192) - (x << 96)] >> 64 +define(`FOLD', ` + sldi F2, $1, 32 + srdi F3, $1, 32 + li F0, 0 + li F1, 0 + subfc F0, F2, F0 + subfe F1, F3, F1
I think the
li F0, 0 li F1, 0 subfc F0, F2, F0 subfe F1, F3, F1
could be replaced with
subfic F0, F2, 0 C "negate with borrow" subfze F1, F3
If that is measurably faster, I can't say.
You are quick to find the exactly fitting instruction. Yes, it definitely does the same job with two less instructions and gives about 1% speedup for only reduction code.
Another option: Since powerpc, like arm, seems to use the proper two's complement convention that "borrow" is not carry, maybe we don't need to negate to F0 and F1 at all, and instead change the later subtraction, replacing
subfc U1, F0, U1 subfe U2, F1, U2 subfe U3, F2, U3 subfe U0, F3, U0
with
addc U1, F0, U1 adde U2, F1, U2 subfe U3, F2, U3 subfe U0, F3, U0
I haven't thought that through, but it does make some sense to me. I think the arm code propagates carry through a mix of add and sub instructions in a some places. Maybe F2 needs to be incremented somewhere for this to work, but probably still cheaper. If this works, FOLD would turn into something like
sldi F0, $1, 32 srdi F1, $1, 32 subfc F2, $1, F0 addme F3, F1
(If you want to investigate this later on, that's fine too, I could merge the code with the current folding logic).
+ C If carry, we need to add in + C 2^256 - p = <0xfffffffe, 0xff..ff, 0xffffffff00000000, 1> + li F0, 0 + addze F0, F0 + neg F2, F0 + sldi F1, F2, 32 + srdi F3, F2, 32 + li U7, -2 + and F3, F3, U7
I think the three instructions to set F3 could be replaced with
srdi F3, F2, 31 sldi F3, F3, 1
Or maybe the and operation is faster than shift?
Regards, /Niels
I will continue to investigate the suggestions you have made.
Amitay.