I'm trying to learn a bit of ppc assembly. Below is an implementation of _chacha_core. Seems to work, when tested on gcc112.fsffrance.org (just put the file in the powerpc64 directory and reconfigure). This machine is little-endian, I haven't yet tested on big-endian.
Unfortunately I don't get any accurate benchmark numbers on that machine, but I think speedup may be on the order of 50%. It could likely be speedup further by processing 2, 3 or 4 blocks in parallel, similar to recent improvements for arm and x86_64. I'd like to do that after the simpler single-block function is properly merged.
I'm not sure where it fits under powerpc64. The code doesn't need any cryptographic extensions, but it depends on vector instructions as well as VSX registers (for the unaligned load and store instructions). So I'd need advice both on the directory hierarchy and compile time configuration, and appropriate runtime tests for fat builds.
Comments on the code highly appreciated! It's the first ppc code I've written, and the reference manual isn't that easy to navigate. The vector instructions seem very nice to work with, and makes for a shorter QROUND than both x86_64 SSE and ARM Neon (these suffer a bit from missing vector rotate instruction).
Help with additional benchmarking would also be useful.
Regards, /Niels
C powerpc64/chacha-core-internal.asm
ifelse(` Copyright (C) 2020 Niels Möller and Torbjörn Granlund This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
C Register usage:
C Argments define(`DST', `r3') define(`SRC', `r4') define(`ROUNDS', `r5')
C Working state define(`X0', `v0') define(`X1', `v1') define(`X2', `v2') define(`X3', `v3') define(`ROT16', `v4') define(`ROT12', `v5') define(`ROT8', `v6') define(`ROT7', `v7')
C Original input state define(`S0', `v8') define(`S1', `v9') define(`S2', `v10') define(`S3', `v11')
C QROUND(X0, X1, X2, X3) define(`QROUND', ` C x0 += x1, x3 ^= x0, x3 lrot 16 C x2 += x3, x1 ^= x2, x1 lrot 12 C x0 += x1, x3 ^= x0, x3 lrot 8 C x2 += x3, x1 ^= x2, x1 lrot 7
vadduwm $1, $1, $2 vxor $4, $4, $1 vrlw $4, $4, ROT16
vadduwm $3, $3, $4 vxor $2, $2, $3 vrlw $2, $2, ROT12
vadduwm $1, $1, $2 vxor $4, $4, $1 vrlw $4, $4, ROT8
vadduwm $3, $3, $4 vxor $2, $2, $3 vrlw $2, $2, ROT7 ')
.text .align 4 C _chacha_core(uint32_t *dst, const uint32_t *src, unsigned rounds)
PROLOGUE(_nettle_chacha_core)
li r6, 0x10 C set up some... li r7, 0x20 C ...useful... li r8, 0x30 C ...offsets
vspltisw ROT16, -16 C -16 instead of 16 actually works! vspltisw ROT12, 12 vspltisw ROT8, 8 vspltisw ROT7, 7
lxvw4x VSR(X0), 0, SRC lxvw4x VSR(X1), r6, SRC lxvw4x VSR(X2), r7, SRC lxvw4x VSR(X3), r8, SRC
vor S0, X0, X0 vor S1, X1, X1 vor S2, X2, X2 vor S3, X3, X3
srdi ROUNDS, ROUNDS, 1 mtctr ROUNDS
.Loop: QROUND(X0, X1, X2, X3) C Rotate rows, to get C 0 1 2 3 C 5 6 7 4 <<< 1 C 10 11 8 9 <<< 2 C 15 12 13 14 <<< 3
vsldoi X1, X1, X1, 4 vsldoi X2, X2, X2, 8 vsldoi X3, X3, X3, 12
QROUND(X0, X1, X2, X3)
C Inverse rotation vsldoi X1, X1, X1, 12 vsldoi X2, X2, X2, 8 vsldoi X3, X3, X3, 4
bdnz .Loop
vadduwm X0, X0, S0 vadduwm X1, X1, S1 vadduwm X2, X2, S2 vadduwm X3, X3, S3
stxvw4x VSR(X0), 0, DST stxvw4x VSR(X1), r6, DST stxvw4x VSR(X2), r7, DST stxvw4x VSR(X3), r8, DST
blr EPILOGUE(_nettle_chacha_core)
On Thu, Sep 24, 2020 at 3:46 PM Niels Möller nisse@lysator.liu.se wrote:
I'm trying to learn a bit of ppc assembly. Below is an implementation of _chacha_core. Seems to work, when tested on gcc112.fsffrance.org (just put the file in the powerpc64 directory and reconfigure). This machine is little-endian, I haven't yet tested on big-endian.
Unfortunately I don't get any accurate benchmark numbers on that machine, but I think speedup may be on the order of 50%...
Yeah, getting accurate benchmark results is difficult on the compile farm. First, you need to moves the machines into performance mode but you can't because you're not an admin. (A script like https://github.com/weidai11/cryptopp/blob/master/TestScripts/governor.sh will do if you are admin).
Second, the ISA seems to produce random looking benchmark results. I've never been able to identify good access patterns to produce consistent results. Part of this problem may be powersave mode. Part of it may be mistakes on my part.
Third, to develop somewhat consistent benchmark statistics, repeat the benchmark several times and discard the outliers. I discard both low- and high-outliers. (The low- outliers may be valid, but I discard them anyway).
Also see "GCC135/Power9 performance?", https://lists.tetaneutral.net/pipermail/cfarm-users/2020-April/000556.html. Andy Polyakov joins the conversation and provides his insights.
Jeff
I'm trying to learn a bit of ppc assembly. Below is an implementation of _chacha_core. Seems to work, when tested on gcc112.fsffrance.org (just put the file in the powerpc64 directory and reconfigure). This machine is little-endian, I haven't yet tested on big-endian.
Great work. The implementation looks fine, I like the idea of using -16 instead of 16 for rotating because vspltisw is limited to (-16 to 15) and vrlw picks the low-order 5 bits which is the same for both -16 and 16. BTW this implementation should work as is on big-endian mode without any hassle because lxvw4x/stxvw4x are endianness aware of loading/storing word values.
Unfortunately I don't get any accurate benchmark numbers on that machine, but I think speedup may be on the order of 50%. It could likely be speedup further by processing 2, 3 or 4 blocks in parallel, similar to recent improvements for arm and x86_64. I'd like to do that after the simpler single-block function is properly merged.
I can benchmark the optimized core but it could take me a few days to get it done, you may want to try Unicamp Minicloud https://openpower.ic.unicamp.br/minicloud or POWER Cloud at OSU http://osuosl.org/services/powerdev Unicamp Minicloud offer good POWER instances and would approve your request in two days.
I'm not sure where it fits under powerpc64. The code doesn't need any cryptographic extensions, but it depends on vector instructions as well as VSX registers (for the unaligned load and store instructions). So I'd need advice both on the directory hierarchy and compile time configuration, and appropriate runtime tests for fat builds.
The VSX instructions are introduced in Power ISA v.2.06 so since you have used VSX instructions lxvw4x/stxvw4x the minimum processor you are targeting is POWER7 We can add new config option like "--enable-power-vsx" that enable this optimization.
On Fri, Sep 25, 2020 at 7:43 AM Maamoun TK maamoun.tk@googlemail.com wrote:
...
I'm not sure where it fits under powerpc64. The code doesn't need any cryptographic extensions, but it depends on vector instructions as well as VSX registers (for the unaligned load and store instructions). So I'd need advice both on the directory hierarchy and compile time configuration, and appropriate runtime tests for fat builds.
The VSX instructions are introduced in Power ISA v.2.06 so since you have used VSX instructions lxvw4x/stxvw4x the minimum processor you are targeting is POWER7 We can add new config option like "--enable-power-vsx" that enable this optimization.
I believe the 64-bit adds (addudm) and subtracts (subudm) require POWER8. POWER7 provides vector unsigned long long (and friends) and the 64-bit loads, but you need POWER8 to do something useful with them.
Or, the 64-bit adds can be performed manually using vector unsigned int with code to manage carry or borrow. It allows you to drop back to POWER4. ChaCha20 is still profitable.
typedef vector unsigned int uint32x4_p;
inline uint32x4_p VecAdd64(const uint32x4_p vec1, const uint32x4_p vec2) { // The carry mask selects carrys for elements 1 and 3 and sets // remaining elements to 0. The results is then shifted so the // carried values are added to elements 0 and 2. #if defined(NETTLE_BIG_ENDIAN) const uint32x4_p zero = {0, 0, 0, 0}; const uint32x4_p mask = {0, 1, 0, 1}; #else const uint32x4_p zero = {0, 0, 0, 0}; const uint32x4_p mask = {1, 0, 1, 0}; #endif
uint32x4_p cy = vec_addc(vec1, vec2); uint32x4_p res = vec_add(vec1, vec2); cy = vec_and(mask, cy); cy = vec_sld (cy, zero, 4); return vec_add(res, cy); #endif }
Here's the core of a subtract:
uint32x4_p bw = vec_subc(vec1, vec2); uint32x4_p res = vec_sub(vec1, vec2); bw = vec_andc(mask, bw); bw = vec_sld (bw, zero, 4); return vec_sub(res, bw);
Jeff
Jeffrey Walton noloader@gmail.com writes:
I believe the 64-bit adds (addudm) and subtracts (subudm) require POWER8.
I don't think there are any 64-bit adds in my chacha code, only 32-bit, vadduwm. The chacha state is fundamentally 16 32-bit words, with operations very friendly to 4-way simd.
Using 64-bit adds might be useful for later code doing multiple blocks, for updating the counter (for the original 64-bit counter variant of chacha). Might make sense to do manual carry handling to keep it working on power7.
So it would make sense to add the code to a new directory powerpc64/p7/ ?
Regards, /Niels
Yes, it would make sense.
On Fri, Sep 25, 2020 at 5:25 PM Niels Möller nisse@lysator.liu.se wrote:
Jeffrey Walton noloader@gmail.com writes:
I believe the 64-bit adds (addudm) and subtracts (subudm) require POWER8.
I don't think there are any 64-bit adds in my chacha code, only 32-bit, vadduwm. The chacha state is fundamentally 16 32-bit words, with operations very friendly to 4-way simd.
Using 64-bit adds might be useful for later code doing multiple blocks, for updating the counter (for the original 64-bit counter variant of chacha). Might make sense to do manual carry handling to keep it working on power7.
So it would make sense to add the code to a new directory powerpc64/p7/ ?
Regards, /Niels
-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
On Fri, Sep 25, 2020 at 10:25 AM Niels Möller nisse@lysator.liu.se wrote:
Jeffrey Walton noloader@gmail.com writes:
I believe the 64-bit adds (addudm) and subtracts (subudm) require POWER8.
I don't think there are any 64-bit adds in my chacha code, only 32-bit, vadduwm. The chacha state is fundamentally 16 32-bit words, with operations very friendly to 4-way simd.
Using 64-bit adds might be useful for later code doing multiple blocks, for updating the counter (for the original 64-bit counter variant of chacha). Might make sense to do manual carry handling to keep it working on power7.
I hope I'm not crossing my wires, but doesn't ChaCha core require a counter addition? That's where a 32-bit wrap can occur, and you need a 64-bit add to handle it correctly. That happens at x[12] and x[13] in Berstein's source code.[1]
Track the use of the PLUSONE macro in Bernstein's code. The '!x->input[12]' is the test for wrap on a 32-bit unsigned integer.
x->input[12] = PLUSONE(x->input[12]); if (!x->input[12]) { x->input[13] = PLUSONE(x->input[13]); /* stopping at 2^70 bytes per nonce is user's responsibility */ }
It should be easy enough to test. Start with a counter of 0xfffffff8 and encrypt a couple of [64-byte] blocks. You can use Bernstein's reference implementation to generate test vectors.[1]
Here's a hacked version of Bernstein's code that allows you to set the counter to something other than 0's: https://github.com/noloader/cryptopp-test/blob/master/ChaCha20/chacha.c. See the XXX_ctr_setup function.
There are some fundamental differences between Bernstein's ChaCha and the IETF's ChaCha used in TLS. Bernstein's ChaCha uses a 64-bit counter. The IETF's version uses a 32-bit counter, and the IETF fails to specify what happens when their 32-bit version wraps. Be sure to specify which version Nettle is providing in the docs because it leads to confusion for users.
[1] https://cr.yp.to/chacha.html and https://cr.yp.to/streamciphers/timings/estreambench/submissions/salsa20/chac....
Jeff
On Fri, Sep 25, 2020 at 11:04 AM Jeffrey Walton noloader@gmail.com wrote:
On Fri, Sep 25, 2020 at 10:25 AM Niels Möller nisse@lysator.liu.se wrote:
Jeffrey Walton noloader@gmail.com writes: ...
It should be easy enough to test. Start with a counter of 0xfffffff8 and encrypt a couple of [64-byte] blocks. You can use Bernstein's reference implementation to generate test vectors.[1]
My bad. Start with a counter of 0xfffffff8 and encrypt or decrypt 16*64 bytes. That will get you into the corner case.
Here's a hacked version of Bernstein's code that allows you to set the counter to something other than 0's: https://github.com/noloader/cryptopp-test/blob/master/ChaCha20/chacha.c. See the XXX_ctr_setup function.
While not obvious, setting the counter is how you seek in the ChaCha stream. It allows you to encrypt or decrypt an arbitrary block of 64-bytes.
Jeff
Jeffrey Walton noloader@gmail.com writes:
I hope I'm not crossing my wires, but doesn't ChaCha core require a counter addition?
Sure, but nettle's _chacha_core function (what I've implemented so far for ppc) does a single block, and doesn't modify the counter. Variants like _chacha_3core (currently implemented for ARM Neon only) need to update the counter.
There are some fundamental differences between Bernstein's ChaCha and the IETF's ChaCha used in TLS. Bernstein's ChaCha uses a 64-bit counter.
That's a bit messy, but nettle supports both variants. To use the ietf version, either use the the chacha_poly1305_* aead functions, or, for chacha only, the functions chacha_set_nonce96 and chacha_crypt32.
And there are tests for 32-bit wraparound in both cases.
Regards, /Niels
Maamoun TK maamoun.tk@googlemail.com writes:
Great work. The implementation looks fine, I like the idea of using -16 instead of 16 for rotating because vspltisw is limited to (-16 to 15) and vrlw picks the low-order 5 bits which is the same for both -16 and 16.
I picked up that trick from Torbjörn Granlund's code.
BTW this implementation should work as is on big-endian mode without any hassle because lxvw4x/stxvw4x are endianness aware of loading/storing word values.
I've pushed it to a branch ppc-chacha-core. But it fails on big-endian powerpc64, see https://gitlab.com/gnutls/nettle/-/jobs/758348866.
And it looks like the error message from the first failing chacha test is truncated, which makes me suspect some error in function prologue or register usage, resulting in some invalid state when the function returns.
Comparing to your assembly code, I don't set FUNC_ALIGN, is that a problem?
Regards, /Niels
Writing .align explicitly instead of defining FUNC_ALIGN has no negative effects except the function won't get alignment for big-endian mode. It looks like there are some additional operations are needed for big-endian mode before storing the results to 'dst' buffer, in chacha-core-internal.c:
#ifdef WORDS_BIGENDIAN #define LE_SWAP32(v) \ ((ROTL32(8, v) & 0x00FF00FFUL) | \ (ROTL32(24, v) & 0xFF00FF00UL)) #else #define LE_SWAP32(v) (v) #endif
for (i = 0; i < _CHACHA_STATE_LENGTH; i++) { uint32_t t = x[i] + src[i]; dst[i] = LE_SWAP32 (t); }
On Fri, Sep 25, 2020 at 10:58 PM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
Great work. The implementation looks fine, I like the idea of using -16 instead of 16 for rotating because vspltisw is limited to (-16 to 15) and vrlw picks the low-order 5 bits which is the same for both -16 and 16.
I picked up that trick from Torbjörn Granlund's code.
BTW this implementation should work as is on big-endian mode without any hassle because lxvw4x/stxvw4x are endianness aware of loading/storing
word
values.
I've pushed it to a branch ppc-chacha-core. But it fails on big-endian powerpc64, see https://gitlab.com/gnutls/nettle/-/jobs/758348866.
And it looks like the error message from the first failing chacha test is truncated, which makes me suspect some error in function prologue or register usage, resulting in some invalid state when the function returns.
Comparing to your assembly code, I don't set FUNC_ALIGN, is that a problem?
Regards, /Niels
-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.
nisse@lysator.liu.se (Niels Möller) writes:
It could likely be speedup further by processing 2, 3 or 4 blocks in parallel.
I've given 2 blocks in parallel a try, but not quite working yet. My work-in-progress code below.
When I test it on the gcc112 machine, it fails with an illegal instruction (SIGILL) on this line, close to function entry:
.globl _nettle_chacha_2core .type _nettle_chacha_2core,%function .align 5 _nettle_chacha_2core: addis 2,12,(.TOC.-_nettle_chacha_2core)@ha addi 2,2,(.TOC.-_nettle_chacha_2core)@l .localentry _nettle_chacha_2core, .-_nettle_chacha_2core
li r8, 0x30 vspltisw v1, 1 => vextractuw v1, v1, 0
I don't understand, from the manual, what's wrong with this. The intention of this piece of code is just to construct the value {1, 0, 0, 0} in one of the vector registers. Maybe there's a better way to do that?
Regards, /Niels
C powerpc64/p7/chacha-core-internal.asm
ifelse(` Copyright (C) 2020 Niels Möller and Torbjörn Granlund This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
C Register usage:
C Argments define(`DST', `r3') define(`SRC', `r4') define(`ROUNDS', `r5')
C State, even elements in X, odd elements in Y define(`X0', `v0') define(`X1', `v1') define(`X2', `v2') define(`X3', `v3') define(`Y0', `v4') define(`Y1', `v5') define(`Y2', `v6') define(`Y3', `v7')
define(`ROT16', `v8') define(`ROT12', `v9') define(`ROT8', `v10') define(`ROT7', `v11')
C Original input state define(`S0', `v12') define(`S1', `v13') define(`S2', `v14') define(`S3', `v15') define(`S3p1', `v16') define(`T0', `v17')
.text C _chacha_2core(uint32_t *dst, const uint32_t *src, unsigned rounds)
define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_chacha_2core)
li r8, 0x30 C offset for x3 vspltisw X1, 1 C {1,1,...,1} vextractuw X1, X1, 0 C {1,0,...,0}
lxvw4x VSR(X3), r8, SRC
vnegw X0, X1 vcmpequw Y3, X3, X0 vand Y3, Y3, X1 C Counter carry out vsldoi Y3, Y3, Y3, 4 vor Y3, Y3, X1
.Lshared_entry: vadduwm Y3, Y3, X3
li r6, 0x10 C set up some... li r7, 0x20 C ...useful... lxvw4x VSR(X0), 0, SRC lxvw4x VSR(X1), r6, SRC lxvw4x VSR(X2), r7, SRC
vor S0, X0, X0 vor S1, X1, X1 vor S2, X2, X2 vor S3, S3, X3 vor S3p1, Y3, Y3
vmrgow Y0, X0, X0 C 1 1 3 3 vmrgew X0, X0, X0 C 0 0 2 2 vmrgow Y1, X1, X1 C 5 5 7 7 vmrgew X1, X1, X1 C 4 4 6 6 vmrgow Y2, X2, X2 C 9 9 11 11 vmrgew X2, X2, X2 C 8 8 10 10 vmrgow Y3, X3, X3 C 13 13 15 15 vmrgew X3, X3, X3 C 12 12 14 14
vspltisw ROT16, -16 C -16 instead of 16 actually works! vspltisw ROT12, 12 vspltisw ROT8, 8 vspltisw ROT7, 7
srdi ROUNDS, ROUNDS, 1 mtctr ROUNDS .Loop: C Register layout (A is first block, B is second block) C C X0: A0 B0 A2 B2 Y0: A1 B1 A3 B3 C X1: A4 B4 A6 B6 Y1: A5 B5 A7 B7 C X2: A8 B8 A10 B10 Y2: A9 B9 A11 B11 C X3: A12 B12 A14 B14 Y3: A13 B13 A15 B15 vadduwm X0, X0, X1 vadduwm Y0, Y0, Y1 vxor X3, X3, X0 vxor Y3, Y3, Y0 vrlw X3, X3, ROT16 vrlw Y3, Y3, ROT16
vadduwm X2, X2, X3 vadduwm Y2, Y2, Y3 vxor X1, X1, X2 vxor Y1, Y1, Y2 vrlw X1, X1, ROT12 vrlw Y1, Y1, ROT12
vadduwm X0, X0, X1 vadduwm Y0, Y0, Y1 vxor X3, X3, X0 vxor Y3, Y3, Y0 vrlw X3, X3, ROT8 vrlw Y3, Y3, ROT8
vadduwm X2, X2, X3 vadduwm Y2, Y2, Y3 vxor X1, X1, X2 vxor Y1, Y1, Y2 vrlw X1, X1, ROT7 vrlw Y1, Y1, ROT7
vsldoi X1, X1, X1, 8 vsldoi X2, X2, X2, 8 vsldoi Y2, Y2, Y2, 8 vsldoi X3, X3, X3, 8
C Register layout: C X0: A0 B0 A2 B2 Y0: A1 B1 A3 B3 C Y1: A5 B5 A7 B7 X1: A6 B6 A4 B4 (X1 swapped) C X2: A10 B10 A8 B8 Y2: A11 A11 A9 B9 (X2, Y2 swapped) C Y3 A15 B15 A13 B13 X3 A12 B12 A14 B14 (X3 swapped)
vadduwm X0, X0, Y1 vadduwm Y0, Y0, X1 vxor Y3, Y3, X0 vxor X3, X3, Y0 vrlw Y3, Y3, ROT16 vrlw X3, X3, ROT16
vadduwm X2, X2, Y3 vadduwm Y2, Y2, X3 vxor Y1, Y1, X2 vxor X1, X1, Y2 vrlw Y1, Y1, ROT12 vrlw X1, X1, ROT12
vadduwm X0, X0, Y1 vadduwm Y0, Y0, Y1 vxor Y3, Y3, X0 vxor X3, X3, Y0 vrlw Y3, Y3, ROT8 vrlw X3, X3, ROT8
vadduwm X2, X2, Y3 vadduwm Y2, Y2, X3 vxor Y1, Y1, X2 vxor X1, X1, Y2 vrlw Y1, Y1, ROT7 vrlw X1, X1, ROT7
vsldoi X1, X1, X1, 8 vsldoi X2, X2, X2, 8 vsldoi Y2, Y2, Y2, 8 vsldoi X3, X3, X3, 8
bdnz .Loop
vmrghw T0, X0, Y0 vmrglw Y0, X0, Y0
vmrghw X0, X1, Y1 vmrglw Y1, X1, Y1
vmrghw X1, X2, Y2 vmrglw Y2, X2, Y2
vmrghw X2, X3, Y3 vmrglw Y3, X3, Y3
vadduwm T0, T0, S0 vadduwm Y0, Y0, S0 vadduwm X0, X0, S1 vadduwm Y1, Y1, S1 vadduwm X1, X1, S2 vadduwm Y2, Y2, S2 vadduwm X2, X2, S3 vadduwm Y3, Y3, S3p1
stxvw4x VSR(T0), 0, DST stxvw4x VSR(X0), r6, DST stxvw4x VSR(X1), r7, DST stxvw4x VSR(X2), r8, DST
addi DST, DST, 64
stxvw4x VSR(T0), 0, DST stxvw4x VSR(X0), r6, DST stxvw4x VSR(X1), r7, DST stxvw4x VSR(X2), r8, DST blr
define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_chacha_2core32) li r8, 0x30 C offset for x3 vspltisw Y3, 1 C {1,1,...,1} vextractuw Y3, Y3, 0 C {1,0,...,0} lxvw4x VSR(X3), r8, SRC b .Lshared_entry EPILOGUE(_nettle_chacha_2core32)
.data .align 4 .Lcount1: .int 1,0,0,0
On Fri, Nov 20, 2020 at 3:40 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
It could likely be speedup further by processing 2, 3 or 4 blocks in parallel.
I've given 2 blocks in parallel a try, but not quite working yet. My work-in-progress code below.
When I test it on the gcc112 machine, it fails with an illegal instruction (SIGILL) on this line, close to function entry:
.globl _nettle_chacha_2core .type _nettle_chacha_2core,%function .align 5 _nettle_chacha_2core: addis 2,12,(.TOC.-_nettle_chacha_2core)@ha addi 2,2,(.TOC.-_nettle_chacha_2core)@l .localentry _nettle_chacha_2core, .-_nettle_chacha_2core
li r8, 0x30 vspltisw v1, 1
=> vextractuw v1, v1, 0
I don't understand, from the manual, what's wrong with this. The intention of this piece of code is just to construct the value {1, 0, 0, 0} in one of the vector registers. Maybe there's a better way to do that?
vextractuw is a Power9 instruction and gcc112 is a Power8 system. The processor does not support the instruction.
gcc135 is a Power9 system.
Thanks, David
On Fri, Nov 20, 2020 at 3:40 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
It could likely be speedup further by processing 2, 3 or 4 blocks in parallel.
I've given 2 blocks in parallel a try, but not quite working yet. My work-in-progress code below.
When I test it on the gcc112 machine, it fails with an illegal instruction (SIGILL) on this line, close to function entry:
.globl _nettle_chacha_2core .type _nettle_chacha_2core,%function .align 5 _nettle_chacha_2core: addis 2,12,(.TOC.-_nettle_chacha_2core)@ha addi 2,2,(.TOC.-_nettle_chacha_2core)@l .localentry _nettle_chacha_2core, .-_nettle_chacha_2core
li r8, 0x30 vspltisw v1, 1
=> vextractuw v1, v1, 0
I don't understand, from the manual, what's wrong with this. The intention of this piece of code is just to construct the value {1, 0, 0, 0} in one of the vector registers. Maybe there's a better way to do that?
GCC112 is a POWER8 machine. According to the POWER manual, vextractuw is a POWER9 instruction.
POWER8 manual: https://openpowerfoundation.org/?resource_lib=power8-processor-users-manual POWER9 manual: https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual
Jeff
Jeffrey Walton noloader@gmail.com writes:
GCC112 is a POWER8 machine. According to the POWER manual, vextractuw is a POWER9 instruction.
POWER8 manual: https://openpowerfoundation.org/?resource_lib=power8-processor-users-manual POWER9 manual: https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual
Ooops. I was reading a document titled "Power ISA(tm) Version 3.1". There are changebars indicating changes from version 3.0, which I weren't paying much attention to. Which POWER version does ISA Version 3.0 correspond to?
I would like to target Power 7 for the chacha code.
Maamoun TK maamoun.tk@googlemail.com writes:
The cheapest replacement I can think of:
vspltisw ZERO,0 C 0x00000000000000000000000000000000 vspltisw ONE,1 C 0x00000001000000010000000100000001 vsldoi ONE, ONE, ZERO,12 C 0x00000001000000000000000000000000
Thanks, I'll try that.
Regards, /Niels
Please don't target Power7. Please target Power9, or at least Power8.
The PPC64LE Linux ABI specifies Power8 as the minimum ISA.
Power ISA 2.07 is Power8. ISA 3.0 is Power9. ISA 3.1 is Power10.
Thanks, David
On Sat, Nov 21, 2020 at 10:11 AM Niels Möller nisse@lysator.liu.se wrote:
Jeffrey Walton noloader@gmail.com writes:
GCC112 is a POWER8 machine. According to the POWER manual, vextractuw is a POWER9 instruction.
POWER8 manual: https://openpowerfoundation.org/?resource_lib=power8-processor-users-manual POWER9 manual: https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual
Ooops. I was reading a document titled "Power ISA(tm) Version 3.1". There are changebars indicating changes from version 3.0, which I weren't paying much attention to. Which POWER version does ISA Version 3.0 correspond to?
I would like to target Power 7 for the chacha code.
Maamoun TK maamoun.tk@googlemail.com writes:
The cheapest replacement I can think of:
vspltisw ZERO,0 C 0x00000000000000000000000000000000 vspltisw ONE,1 C 0x00000001000000010000000100000001 vsldoi ONE, ONE, ZERO,12 C 0x00000001000000000000000000000000
Thanks, I'll try that.
Regards, /Niels
-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
On Sat, Nov 21, 2020 at 10:20 AM David Edelsohn dje.gcc@gmail.com wrote:
Please don't target Power7. Please target Power9, or at least Power8.
The PPC64LE Linux ABI specifies Power8 as the minimum ISA.
Power ISA 2.07 is Power8. ISA 3.0 is Power9. ISA 3.1 is Power10.
Small nit... PowerMac G4's and G5's still have a strong following. There's a lot of activity on Debian's PowerPC list.
The G4's and G5's provide Altivec acceleration and the old gcc compiler even accepts -mcpu=power4.
ChaCha is a simple algorithm that benefits from Altivec, even when you manage the 64-bit additions/carries in a 32x4 vector arrangement.
Jeff
On Sat, Nov 21, 2020 at 10:57 AM Jeffrey Walton noloader@gmail.com wrote:
On Sat, Nov 21, 2020 at 10:20 AM David Edelsohn dje.gcc@gmail.com wrote:
Please don't target Power7. Please target Power9, or at least Power8.
The PPC64LE Linux ABI specifies Power8 as the minimum ISA.
Power ISA 2.07 is Power8. ISA 3.0 is Power9. ISA 3.1 is Power10.
Small nit... PowerMac G4's and G5's still have a strong following. There's a lot of activity on Debian's PowerPC list.
The G4's and G5's provide Altivec acceleration and the old gcc compiler even accepts -mcpu=power4.
ChaCha is a simple algorithm that benefits from Altivec, even when you manage the 64-bit additions/carries in a 32x4 vector arrangement.
Small nit: G4 and G5 Macs are not Power7. If an implementation of a cipher targets Power7, it still can use ISA instructions not supported by PowerMacs. If you want to provide an additional implementation for pure Altivec, that's fine.
There is a vocal group of Debian PowerPC users. I greatly appreciate support and advocacy. But the number of actual users is very small. And it's highly unlikely that those users will run ChaCha cipher in production. The ChaCha implementation is new, not maintaining existing support.
If Niels wants to implement an optimized version of a cipher on Power that will be useful in production environments and applied in global businesses, I would recommend that he target Power9. A new, high-performance implementation will be deployed on new systems for new applications or new versions of applications.
Thanks, David
On Sat, Nov 21, 2020 at 11:23 AM David Edelsohn dje.gcc@gmail.com wrote:
On Sat, Nov 21, 2020 at 10:57 AM Jeffrey Walton noloader@gmail.com wrote:
On Sat, Nov 21, 2020 at 10:20 AM David Edelsohn dje.gcc@gmail.com wrote:
Please don't target Power7. Please target Power9, or at least Power8.
The PPC64LE Linux ABI specifies Power8 as the minimum ISA.
Power ISA 2.07 is Power8. ISA 3.0 is Power9. ISA 3.1 is Power10.
Small nit... PowerMac G4's and G5's still have a strong following. There's a lot of activity on Debian's PowerPC list.
The G4's and G5's provide Altivec acceleration and the old gcc compiler even accepts -mcpu=power4.
ChaCha is a simple algorithm that benefits from Altivec, even when you manage the 64-bit additions/carries in a 32x4 vector arrangement.
Small nit: G4 and G5 Macs are not Power7. If an implementation of a cipher targets Power7, it still can use ISA instructions not supported by PowerMacs. If you want to provide an additional implementation for pure Altivec, that's fine.
Correct.
There is a vocal group of Debian PowerPC users. I greatly appreciate support and advocacy. But the number of actual users is very small. And it's highly unlikely that those users will run ChaCha cipher in production. The ChaCha implementation is new, not maintaining existing support.
If Niels wants to implement an optimized version of a cipher on Power that will be useful in production environments and applied in global businesses, I would recommend that he target Power9. A new, high-performance implementation will be deployed on new systems for new applications or new versions of applications.
When you said the library should not target POWER7, and only target POWER8 and POWER9, I took that to mean the library should not target POWER7 and below.
Altivec and POWER4 is a fine target given the user base. It will even run on POWER7.
An Altivec version of ChaCha is an easy implementation. There are no pain points in implementing it.
Jeff
On Sat, Nov 21, 2020 at 11:32 AM Jeffrey Walton noloader@gmail.com wrote:
On Sat, Nov 21, 2020 at 11:23 AM David Edelsohn dje.gcc@gmail.com wrote:
On Sat, Nov 21, 2020 at 10:57 AM Jeffrey Walton noloader@gmail.com wrote:
On Sat, Nov 21, 2020 at 10:20 AM David Edelsohn dje.gcc@gmail.com wrote:
Please don't target Power7. Please target Power9, or at least Power8.
The PPC64LE Linux ABI specifies Power8 as the minimum ISA.
Power ISA 2.07 is Power8. ISA 3.0 is Power9. ISA 3.1 is Power10.
Small nit... PowerMac G4's and G5's still have a strong following. There's a lot of activity on Debian's PowerPC list.
The G4's and G5's provide Altivec acceleration and the old gcc compiler even accepts -mcpu=power4.
ChaCha is a simple algorithm that benefits from Altivec, even when you manage the 64-bit additions/carries in a 32x4 vector arrangement.
Small nit: G4 and G5 Macs are not Power7. If an implementation of a cipher targets Power7, it still can use ISA instructions not supported by PowerMacs. If you want to provide an additional implementation for pure Altivec, that's fine.
Correct.
There is a vocal group of Debian PowerPC users. I greatly appreciate support and advocacy. But the number of actual users is very small. And it's highly unlikely that those users will run ChaCha cipher in production. The ChaCha implementation is new, not maintaining existing support.
If Niels wants to implement an optimized version of a cipher on Power that will be useful in production environments and applied in global businesses, I would recommend that he target Power9. A new, high-performance implementation will be deployed on new systems for new applications or new versions of applications.
When you said the library should not target POWER7, and only target POWER8 and POWER9, I took that to mean the library should not target POWER7 and below.
Altivec and POWER4 is a fine target given the user base. It will even run on POWER7.
An Altivec version of ChaCha is an easy implementation. There are no pain points in implementing it.
Nettle can target any processors and ISA levels that it wishes. Niels wrote:
I would like to target Power 7 for the chacha code.
I responded that Power9 (or at least Power8) would be preferred. If Niels wants the implementation to impact production deployments and increase the use of Nettle for cryptography on Power systems, I recommend that he target a more recent level of the ISA. He can target Power7, and Power4, and pure Altivec as well.
Thanks, David
David Edelsohn dje.gcc@gmail.com writes:
I responded that Power9 (or at least Power8) would be preferred. If Niels wants the implementation to impact production deployments and increase the use of Nettle for cryptography on Power systems, I recommend that he target a more recent level of the ISA. He can target Power7, and Power4, and pure Altivec as well.
The basic chacha code I added some month ago uses altivec instructions, and the instructions lxvw4x and stxvw4x (with vsr registers) for load and store, to make it easier to work with data that is only 32-bit aligned.
I put that code under the powerpc64/p7/ directory, under the belief that the code should work fine for all Power7 and later (with the caveat that I don't know to which degree altivec is an optional feature).
It may also be relavant to note that with the current configure script, no power assembly is used unconditionally by default, it has to be enabled either explicitly with configure arguments, or based on runtine checks, if configured with --enable-fat.
That means that the name of powerpc64/p7/ directory doesn't matter much technically (I would be fine to rename it to, e.g., altivec/). But I got the impression from the list discussion that p7/ was reasonable.
And my intention is that improved chacha code should target the same processor flavors as the existing more basic implementation. So I need to replace the use of the vextractuw (which isn't used in the most performance critical part of the function).
Regards, /Niels
On Fri, Nov 20, 2020 at 10:40 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes: The intention of this piece of code is just to construct the value {1, 0, 0, 0} in one of the vector registers. Maybe there's a better way to do that?
The cheapest replacement I can think of:
vspltisw ZERO,0 C 0x00000000000000000000000000000000 vspltisw ONE,1 C 0x00000001000000010000000100000001 vsldoi ONE, ONE, ZERO,12 C 0x00000001000000000000000000000000
regards, Mamone
Niels Möller nisse@lysator.liu.se writes:
It could likely be speedup further by processing 2, 3 or 4 blocks in parallel.
I've given 2 blocks in parallel a try, but not quite working yet. My work-in-progress code below.
I've got it into working shape now, at least for little-endian. See https://git.lysator.liu.se/nettle/nettle/-/blob/ppc-chacha-2core/powerpc64/p...
Next steps:
1. Fix it to work also for big-endian,
2. Wire it up for fat builds.
3. Try out if 4-way gives additional speedup.
Benchmarking is appreciated. Compare the master branch to the ppc-chacha-2core branch, configured with --enable-power-altivec, and run ./examples/nettle-benchmark chacha.
Regards, /Niels
Thank you for your work.
On POWER9 I got the following benchmark result:
./configured: chacha encrypt 308.58 chacha decrypt 325.87 ./configured --enable-power-altivec "master branch": chacha encrypt 342.15 chacha decrypt 356.24 ./configured --enable-power-altivec "ppc-chacha-2core": chacha encrypt 648.97 chacha decrypt 648.00
It's gotten better with every further optimization on the core, great work.
regards, Mamone
On Mon, Nov 23, 2020 at 6:50 PM Niels Möller nisse@lysator.liu.se wrote:
Niels Möller nisse@lysator.liu.se writes:
It could likely be speedup further by processing 2, 3 or 4 blocks in parallel.
I've given 2 blocks in parallel a try, but not quite working yet. My work-in-progress code below.
I've got it into working shape now, at least for little-endian. See
https://git.lysator.liu.se/nettle/nettle/-/blob/ppc-chacha-2core/powerpc64/p...
Next steps:
Fix it to work also for big-endian,
Wire it up for fat builds.
Try out if 4-way gives additional speedup.
Benchmarking is appreciated. Compare the master branch to the ppc-chacha-2core branch, configured with --enable-power-altivec, and run ./examples/nettle-benchmark chacha.
Regards, /Niels
-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Maamoun TK maamoun.tk@googlemail.com writes:
On POWER9 I got the following benchmark result:
./configured: chacha encrypt 308.58 chacha decrypt 325.87 ./configured --enable-power-altivec "master branch": chacha encrypt 342.15 chacha decrypt 356.24 ./configured --enable-power-altivec "ppc-chacha-2core": chacha encrypt 648.97 chacha decrypt 648.00
It's gotten better with every further optimization on the core, great work.
Nice. So almost a factor 2 speedup from doing 2 blocks in parallel. I wonder if one can get close to another factor of two by going to 4 blocks. I hope to get the time to try that out, it should be fairly easy. (And if that does work out fine, maybe the code to do only 2 blocks could be removed).
Regards, /Niels
On Wed, Nov 25, 2020 at 3:22 AM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
On POWER9 I got the following benchmark result:
./configured: chacha encrypt 308.58 chacha decrypt 325.87 ./configured --enable-power-altivec "master branch": chacha encrypt 342.15 chacha decrypt 356.24 ./configured --enable-power-altivec "ppc-chacha-2core": chacha encrypt 648.97 chacha decrypt 648.00
It's gotten better with every further optimization on the core, great work.
Nice. So almost a factor 2 speedup from doing 2 blocks in parallel. I wonder if one can get close to another factor of two by going to 4 blocks. I hope to get the time to try that out, it should be fairly easy. (And if that does work out fine, maybe the code to do only 2 blocks could be removed).
Botan and Crypto++ uses 4x blocks. They usually hit about the same benchmark numbers.
For Crypto++ on GCC112, mixed message sizes:
* ChaCha20: 1200 MB/s, 2.9 cpb * ChaCha8: 2370 MB/s, 1.5 cpb
On an antique PowerMac G5:
* ChaCha20: 400 MB/s, 4.9 cpb * ChaCha8: 725 MB/s, 2.6 cpb
Bernstein's results are at https://bench.cr.yp.to/results-stream.html. He's showing 9 cpb on a 2006 IBM PowerPC. His implementation has a lot of opportunities for improvement. Also see https://cr.yp.to/streamciphers/timings/estreambench/submissions/salsa20/chac....
Jeff
Niels Möller nisse@lysator.liu.se writes:
I've got it into working shape now, at least for little-endian. See https://git.lysator.liu.se/nettle/nettle/-/blob/ppc-chacha-2core/powerpc64/p...
Next steps:
Fix it to work also for big-endian,
Wire it up for fat builds.
Done, pushed to the ppc-chacha-2core branch. (I see no obstacles to merging it to the master branch).
Regards, /Niels
Niels Möller nisse@lysator.liu.se writes:
- Try out if 4-way gives additional speedup.
Below code seems to work (but is not yet a drop-in replacement, since it needs some wireup in chacha.crypt.c, and 32-bit counter variant and BE swapping not yet implemented). Seems to give almost a factor of 2 speedup over chacha_2core. In theory, it could give slightly more than a factor 2, since all datashuffling between qrounds (the vsldoi instructinos in the chacha_2core.asm main loop) has been eliminated.
Questions:
1. Does the save and restore of registers look correct? I checked the abi spec, and the intention is to use the part of the 288 byte "Protected zone" below the stack pointer.
2. The use of the QR macro means that there's no careful instruction-level interleaving of independent instructions. Do you think it's beneficial to do manual interleaving (like in chacha_2core.asm), or can it be left to the out-of-order execution logic run sort it out and execute instructions in parallel?
3. Is there any clever way to construct the vector {0,1,2,3} in a register, instead of loading it from memory?
Regards, /Niels
C powerpc64/chacha-4core.asm
ifelse(` Copyright (C) 2020 Niels Möller and Torbjörn Granlund This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
C Register usage:
define(`SP', `r1') define(`TOCP', `r2')
C Argments define(`DST', `r3') define(`SRC', `r4') define(`ROUNDS', `r5')
C Working state in v0,...,v15
define(`ROT16', v16) define(`ROT12', v17) define(`ROT8', v18) define(`ROT7', v19)
C During the loop, used to save the original values for last 4 words C of each block. Also used as temporaries for transpose- define(`T0', `v20') define(`T1', `v21') define(`T2', `v22') define(`T3', `v23')
C Main loop for round define(`QR',` vadduwm $1, $1, $2 vxor $4, $4, $1 vrlw $4, $4, ROT16 vadduwm $3, $3, $4 vxor $2, $2, $3 vrlw $2, $2, ROT12 vadduwm $1, $1, $2 vxor $4, $4, $1 vrlw $4, $4, ROT8 vadduwm $3, $3, $4 vxor $2, $2, $3 vrlw $2, $2, ROT7 ')
define(`TRANSPOSE',` vmrghw T0, $1, $3 C A0 A2 B0 B2 vmrghw T1, $2, $4 C A1 A3 B1 B3 vmrglw T2, $1, $3 C C0 C2 D0 D2 vmrglw T3, $2, $4 C C1 C3 D1 D3
vmrghw $1, T0, T1 C A0 A1 A2 A3 vmrglw $2, T0, T1 C B0 B1 B2 B3 vmrghw $3, T2, T3 C C0 C2 C1 C3 vmrglw $4, T2, T3 C D0 D1 D2 D3 ')
C _chacha_4core(uint32_t *dst, const uint32_t *src, unsigned rounds) define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_chacha_4core)
li r6, 0x10 C set up some... li r7, 0x20 C ...useful... li r8, 0x30 C ...offsets
addi r1, r1, -0x40 C Save callee-save registers stvx v20, 0, r1 stvx v21, r6, r1 stvx v22, r7, r1 stvx v23, r8, r1
vspltisw ROT16, -16 C -16 instead of 16 actually works! vspltisw ROT12, 12 vspltisw ROT8, 8 vspltisw ROT7, 7
C Load state while splating it, incrementing "pos" fields as we go lxvw4x VSR(v0), 0, SRC C "expa ..." lxvw4x VSR(v4), r6, SRC C key lxvw4x VSR(v8), r7, SRC C key lxvw4x VSR(v12), r8, SRC C cnt and nonce
vspltw v1, v0, 1 vspltw v2, v0, 2 vspltw v3, v0, 3 vspltw v0, v0, 0 vspltw v5, v4, 1 vspltw v6, v4, 2 vspltw v7, v4, 3 vspltw v4, v4, 0 vspltw v9, v8, 1 vspltw v10, v8, 2 vspltw v11, v8, 3 vspltw v8, v8, 0 vspltw v13, v12, 1 vspltw v14, v12, 2 vspltw v15, v12, 3 vspltw v12, v12, 0
ld r9, .Lcnts@got(r2) lxvw4x VSR(T0), 0, r9 C increments vaddcuw T1, v12, T0 C compute carry-out vadduwm v12, v12, T0 C low adds vadduwm v13, v13, T1 C apply carries
C Save all 4x4 of the last words. vor T0, v12, v12 C save pos field until... vor T1, v13, v13 C ...after rounds vor T2, v14, v14 vor T3, v15, v15
srdi ROUNDS, ROUNDS, 1 mtctr ROUNDS .Loop: QR(v0, v4, v8, v12) QR(v1, v5, v9, v13) QR(v2, v6, v10, v14) QR(v3, v7, v11, v15) QR(v0, v5, v10, v15) QR(v1, v6, v11, v12) QR(v2, v7, v8, v13) QR(v3, v4, v9, v14) bdnz .Loop
C Add in saved original words, including counters, before C transpose. vadduwm v12, v12, T0 vadduwm v13, v13, T1 vadduwm v14, v14, T2 vadduwm v15, v15, T3
TRANSPOSE(v0, v1,v2, v3) TRANSPOSE(v4, v5, v6, v7) TRANSPOSE(v8, v9, v10, v11) TRANSPOSE(v12, v13, v14, v15)
lxvw4x VSR(T0), 0, SRC lxvw4x VSR(T1), r6, SRC lxvw4x VSR(T2), r7, SRC
vadduwm v0, v0, T0 vadduwm v1, v1, T0 vadduwm v2, v2, T0 vadduwm v3, v3, T0
vadduwm v4, v4, T1 vadduwm v5, v5, T1 vadduwm v6, v6, T1 vadduwm v7, v7, T1
vadduwm v8, v8, T2 vadduwm v9, v9, T2 vadduwm v10, v10, T2 vadduwm v11, v11, T2
stxvw4x VSR(v0), 0, DST stxvw4x VSR(v4), r6, DST stxvw4x VSR(v8), r7, DST stxvw4x VSR(v12), r8, DST
addi DST, DST, 64
stxvw4x VSR(v1), 0, DST stxvw4x VSR(v5), r6, DST stxvw4x VSR(v9), r7, DST stxvw4x VSR(v13), r8, DST
addi DST, DST, 64
stxvw4x VSR(v2), 0, DST stxvw4x VSR(v6), r6, DST stxvw4x VSR(v10), r7, DST stxvw4x VSR(v14), r8, DST
addi DST, DST, 64
stxvw4x VSR(v3), 0, DST stxvw4x VSR(v7), r6, DST stxvw4x VSR(v11), r7, DST stxvw4x VSR(v15), r8, DST
C Restore callee-save registers lvx v20, 0, r1 lvx v21, r6, r1 lvx v22, r7, r1 lvx v23, r8, r1 addi r1, r1, 0x40
blr EPILOGUE(_nettle_chacha_4core)
.section .rodata ALIGN(16) .Lcnts: .long 0,1,2,3 C increments
On Mon, Nov 30, 2020 at 12:37 PM Niels Möller nisse@lysator.liu.se wrote:
Niels Möller nisse@lysator.liu.se writes:
- Does the save and restore of registers look correct? I checked the abi spec, and the intention is to use the part of the 288 byte "Protected zone" below the stack pointer.
There are requirements should be applied when modifying the stack pointer register, I will add the needed rules from https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi-1.9.html
- The stack pointer shall maintain quadword alignment. - The stack pointer shall point to the first word of the lowest allocated stack frame, the "back chain" word. The stack shall grow downward, that is, toward lower addresses. The first word of the stack frame shall always point to the previously allocated stack frame (toward higher addresses), except for the first stack frame, which shall have a back chain of 0 (NULL). - The stack pointer shall be decremented and the back chain updated atomically using one of the "Store Double Word with Update" instructions, so that the stack pointer always points to the beginning of a linked list of stack frames.
so to modify r1 you have to allocate additional 8 bytes in the stack to store the old value of r1. The register store sequence will look like:
li r6, 0x10 C set up some... li r7, 0x20 C ...useful... li r8, 0x30 C ...offsets li r9, 0x40 C ...offsets
stdu r1, -0x50(r1) C Save callee-save registers stvx v20, r6, r1 stvx v21, r7, r1 stvx v22, r8, r1 stvx v23, r9, r1
note that the allocated size is rounded up to a multiple of 16 bytes, so that quadword stack alignment is maintained.
and the register restore sequence will look like:
lvx v20, r6, r1 lvx v21, r7, r1 lvx v22, r8, r1 lvx v23, r9, r1 addi r1, r1, 0x50
BTW since there is no function called while the register of the stack frame is modified, I think it's fine to not follow the rules and keep the store and restore sequences as are without any modification.
2. The use of the QR macro means that there's no careful
instruction-level interleaving of independent instructions. Do you think it's beneficial to do manual interleaving (like in chacha_2core.asm), or can it be left to the out-of-order execution logic run sort it out and execute instructions in parallel?
You'll get performance benefits by interleaving the independent instructions in this case, I can estimate the increase of performance around 20%-30%.
- Is there any clever way to construct the vector {0,1,2,3} in a register, instead of loading it from memory?
I can think of this method:
li r10,0 lvsl T0,0,r10 C 0x000102030405060708090A0B0C0D0E0F vupkhsb T0,T0 C 0x00000001000200030004000500060007 vupkhsh T0,T0 C 0x00000000000000010000000200000003
regards, Mamone
On Mon, Nov 30, 2020 at 10:07 PM Maamoun TK maamoun.tk@googlemail.com wrote:
BTW since there is no function called while the register of the stack frame is modified, I think it's fine to not follow the rules and keep the store and restore sequences as are without any modification.
I'm thinking what could happen if an exception raised while the stack frame is modified incorrectly, the exception handler will try to look at the calling function but it can't get the previous state of stack pointer because the stack pointer doesn't point to it and that will mess the exception handling procedure. So we can't ignore the rules whatsoever and we have to modify the stack frame correctly.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
I'm thinking what could happen if an exception raised while the stack frame is modified incorrectly, the exception handler will try to look at the calling function but it can't get the previous state of stack pointer because the stack pointer doesn't point to it and that will mess the exception handling procedure. So we can't ignore the rules whatsoever and we have to modify the stack frame correctly.
Hmm. I agree just lowering the stack pointer sounds a bit questionable. But if we use some other register to point into the protected zone, we should be fine? E.g.,
addi r10, r1, -0x40 C Save callee-save registers stvx v20, 0, r10 stvx v21, r6, r10 stvx v22, r7, r10 stvx v23, r8, r10
Regards, /Niels
On Mon, Nov 30, 2020 at 10:56 PM Niels Möller nisse@lysator.liu.se wrote:
Hmm. I agree just lowering the stack pointer sounds a bit questionable. But if we use some other register to point into the protected zone, we should be fine? E.g.,
addi r10, r1, -0x40 C Save callee-save registers stvx v20, 0, r10 stvx v21, r6, r10 stvx v22, r7, r10 stvx v23, r8, r10
This is totally fine.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
On Mon, Nov 30, 2020 at 10:56 PM Niels Möller nisse@lysator.liu.se wrote:
Hmm. I agree just lowering the stack pointer sounds a bit questionable. But if we use some other register to point into the protected zone, we should be fine? E.g.,
addi r10, r1, -0x40 C Save callee-save registers stvx v20, 0, r10 stvx v21, r6, r10 stvx v22, r7, r10 stvx v23, r8, r10
This is totally fine.
I changed it to do this (and it looks like you use the protected zone as a save area also in the new aes code).
How portable is this, do all relevant operating systems support storing data below the stack pointer?
Regards, /Niels
On Tue, Dec 1, 2020 at 8:02 PM Niels Möller nisse@lysator.liu.se wrote:
How portable is this, do all relevant operating systems support storing data below the stack pointer?
I need to investigate this.
regards, Mamone
On Wed, Dec 2, 2020 at 9:41 AM Maamoun TK maamoun.tk@googlemail.com wrote:
On Tue, Dec 1, 2020 at 8:02 PM Niels Möller nisse@lysator.liu.se wrote:
How portable is this, do all relevant operating systems support storing data below the stack pointer?
I need to investigate this.
It's dependent upon the ABI.
Thanks, David
I can't find a document other than 64-bit elf v2 abi specification https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specificatio... which say it's safe to use the 288-byte volatile storage below the stack pointer to hold saved registers and local variables. However, I wrote a C file for the test and disassembled the compiled binary on ELFv1, ELFv2, and AIX. All of them hold the saved registers right below the stack pointer. Furthermore same as we did, the compiler try to avoid modifying the stack pointer register when possible, the prologue of tested binary looks like this: std r30,-16(r1) std r31,-8(r1) li r0,-80 stvx v28,r1,r0 li r0,-64 stvx v29,r1,r0 li r0,-48 stvx v30,r1,r0 li r0,-32 stvx v31,r1,r0
regards, Mamone
On Wed, Dec 2, 2020 at 7:31 PM David Edelsohn dje.gcc@gmail.com wrote:
On Wed, Dec 2, 2020 at 9:41 AM Maamoun TK maamoun.tk@googlemail.com wrote:
On Tue, Dec 1, 2020 at 8:02 PM Niels Möller nisse@lysator.liu.se
wrote:
How portable is this, do all relevant operating systems support storing data below the stack pointer?
I need to investigate this.
It's dependent upon the ABI.
Thanks, David
Hi, Maamoun
I thought that you were asking in general. All PowerPC ABI, except the original 32 bit ELF ABI, allow a red zone below the stack pointer. For other architectures, one needs to check each ABI.
Thanks, David
On Wed, Dec 2, 2020 at 12:57 PM Maamoun TK maamoun.tk@googlemail.com wrote:
I can't find a document other than 64-bit elf v2 abi specification https://openpowerfoundation.org/?resource_lib=64-bit-elf-v2-abi-specificatio... which say it's safe to use the 288-byte volatile storage below the stack pointer to hold saved registers and local variables. However, I wrote a C file for the test and disassembled the compiled binary on ELFv1, ELFv2, and AIX. All of them hold the saved registers right below the stack pointer. Furthermore same as we did, the compiler try to avoid modifying the stack pointer register when possible, the prologue of tested binary looks like this: std r30,-16(r1) std r31,-8(r1) li r0,-80 stvx v28,r1,r0 li r0,-64 stvx v29,r1,r0 li r0,-48 stvx v30,r1,r0 li r0,-32 stvx v31,r1,r0
regards, Mamone
On Wed, Dec 2, 2020 at 7:31 PM David Edelsohn dje.gcc@gmail.com wrote:
On Wed, Dec 2, 2020 at 9:41 AM Maamoun TK maamoun.tk@googlemail.com wrote:
On Tue, Dec 1, 2020 at 8:02 PM Niels Möller nisse@lysator.liu.se wrote:
How portable is this, do all relevant operating systems support storing data below the stack pointer?
I need to investigate this.
It's dependent upon the ABI.
Thanks, David
David Edelsohn dje.gcc@gmail.com writes:
I thought that you were asking in general. All PowerPC ABI, except the original 32 bit ELF ABI, allow a red zone below the stack pointer. For other architectures, one needs to check each ABI.
Do any of you know what ABI was used on Macs with 64-bit powerpc processors (starting from https://en.wikipedia.org/wiki/Power_Mac_G5, if I understand it correctly?). Probably not worth much effort to support these, but it would be good to at least know if the new assembly files are compatible with that ABI or not.
Regards, /Niels
Apple Darwin on PPC has its own ABI.
The Power Mac G5 processor (PPC970) supported the initial Altivec ISA.
Thanks, David
On Wed, Dec 2, 2020 at 1:47 PM Niels Möller nisse@lysator.liu.se wrote:
David Edelsohn dje.gcc@gmail.com writes:
I thought that you were asking in general. All PowerPC ABI, except the original 32 bit ELF ABI, allow a red zone below the stack pointer. For other architectures, one needs to check each ABI.
Do any of you know what ABI was used on Macs with 64-bit powerpc processors (starting from https://en.wikipedia.org/wiki/Power_Mac_G5, if I understand it correctly?). Probably not worth much effort to support these, but it would be good to at least know if the new assembly files are compatible with that ABI or not.
Regards, /Niels
-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.
Niels Möller nisse@lysator.liu.se writes:
Below code seems to work (but is not yet a drop-in replacement, since it needs some wireup in chacha.crypt.c, and 32-bit counter variant and BE swapping not yet implemented).
I fixed these issues, as well as fat build support. Pushed to the branch ppc-chacha-4core. Seems to work fine (but the issue with a possibly bad use of the stackpointer not yet fixed). Plese try it out.
Regards, /Niels
on POWER9 I get the following benchmark with ". /configure --enable-power-altivec":
chacha encrypt 763.57 chacha decrypt 780.64
regards, Mamone
On Mon, Nov 30, 2020 at 11:08 PM Niels Möller nisse@lysator.liu.se wrote:
Niels Möller nisse@lysator.liu.se writes:
Below code seems to work (but is not yet a drop-in replacement, since it needs some wireup in chacha.crypt.c, and 32-bit counter variant and BE swapping not yet implemented).
I fixed these issues, as well as fat build support. Pushed to the branch ppc-chacha-4core. Seems to work fine (but the issue with a possibly bad use of the stackpointer not yet fixed). Plese try it out.
Regards, /Niels
-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance. _______________________________________________ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
On Mon, Nov 30, 2020 at 11:18 PM Maamoun TK maamoun.tk@googlemail.com wrote:
on POWER9 I get the following benchmark with ". /configure --enable-power-altivec":
chacha encrypt 763.57 chacha decrypt 780.64
regards, Mamone
I got this result using ppc-chacha-2core branch on same machine:
chacha encrypt 565.79 chacha decrypt 582.10
Maamoun TK maamoun.tk@googlemail.com writes:
On Mon, Nov 30, 2020 at 11:18 PM Maamoun TK maamoun.tk@googlemail.com wrote:
on POWER9 I get the following benchmark with ". /configure --enable-power-altivec":
chacha encrypt 763.57 chacha decrypt 780.64
regards, Mamone
I got this result using ppc-chacha-2core branch on same machine:
chacha encrypt 565.79 chacha decrypt 582.10
Thanks for testing! That's a nice speedup, but a bit less then the factor two I was hoping for. Maybe better interleaving can help.
BTW, the chacha_2core code is merged to the master branch now.
Regards, /Niels
Maamoun TK maamoun.tk@googlemail.com writes:
On Mon, Nov 30, 2020 at 11:18 PM Maamoun TK maamoun.tk@googlemail.com wrote:
on POWER9 I get the following benchmark with ". /configure --enable-power-altivec":
chacha encrypt 763.57 chacha decrypt 780.64
regards, Mamone
I got this result using ppc-chacha-2core branch on same machine:
chacha encrypt 565.79 chacha decrypt 582.10
I've tried running the benchmark on gcc135, and that gives me much more consistent values than gcc112. The 2-way code (currently on master branch) gives 686 Mbyte/2. The 4-way code you tried gives 958 MByte/s. I then replaced the innerloop with a versino with better interleaving, written by Torbjörn Granlund (just pushed to the branch). That gives 1225 Mbyte/s.
And for reference, the plain C implementation gives 363 MByte/s.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se