The last patch follows the C implementation but I just figured out a decent way to do it. --- powerpc64/p7/chacha-core-internal.asm | 22 +++++++++++++++++++++- 1 file changed, 21 insertions(+), 1 deletion(-)
diff --git a/powerpc64/p7/chacha-core-internal.asm b/powerpc64/p7/chacha-core-internal.asm index 33c721c1..76ca0d45 100644 --- a/powerpc64/p7/chacha-core-internal.asm +++ b/powerpc64/p7/chacha-core-internal.asm @@ -53,6 +53,10 @@ define(`S1', `v9') define(`S2', `v10') define(`S3', `v11')
+C Big-endian working state +define(`LE_MASK', `v12') +define(`LE_TEMP', `v13') + C QROUND(X0, X1, X2, X3) define(`QROUND', ` C x0 += x1, x3 ^= x0, x3 lrot 16 @@ -77,10 +81,18 @@ define(`QROUND', ` vrlw $2, $2, ROT7 ')
+C LE_SWAP32(X0, X1, X2, X3) +define(`LE_SWAP32', `IF_BE(` + vperm X0, X0, X0, LE_MASK + vperm X1, X1, X1, LE_MASK + vperm X2, X2, X2, LE_MASK + vperm X3, X3, X3, LE_MASK +')') + .text - .align 4 C _chacha_core(uint32_t *dst, const uint32_t *src, unsigned rounds)
+define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_chacha_core)
li r6, 0x10 C set up some... @@ -91,6 +103,12 @@ PROLOGUE(_nettle_chacha_core) vspltisw ROT12, 12 vspltisw ROT8, 8 vspltisw ROT7, 7 +IF_BE(` + li r9, 0 + lvsl LE_MASK, r9, r9 + vspltisb LE_TEMP, 0x03 + vxor LE_MASK, LE_MASK, LE_TEMP +')
lxvw4x VSR(X0), 0, SRC lxvw4x VSR(X1), r6, SRC @@ -131,6 +149,8 @@ PROLOGUE(_nettle_chacha_core) vadduwm X2, X2, S2 vadduwm X3, X3, S3
+ LE_SWAP32(X0, X1, X2, X3) + stxvw4x VSR(X0), 0, DST stxvw4x VSR(X1), r6, DST stxvw4x VSR(X2), r7, DST
Maamoun TK maamoun.tk@googlemail.com writes:
The last patch follows the C implementation but I just figured out a decent way to do it.
Thanks! Applied, and pushed on the ppc-chacha-core branch for testing. (Had apply it semi-manually, since the file to patch indents using TAB and those were replaced by spaces in the emailed patch).
+IF_BE(`
- li r9, 0
- lvsl LE_MASK, r9, r9
- vspltisb LE_TEMP, 0x03
- vxor LE_MASK, LE_MASK, LE_TEMP
+')
I think this deserves some comments, on what goes into the register in each step. Clever that the endian conversion corresponds to xoring the byte indices with 3.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Maamoun TK maamoun.tk@googlemail.com writes:
The last patch follows the C implementation but I just figured out a decent way to do it.
Thanks! Applied, and pushed on the ppc-chacha-core branch for testing. (Had apply it semi-manually, since the file to patch indents using TAB and those were replaced by spaces in the emailed patch).
And tests seems to pass also on big-endian. Nice!
Regards, /Niels
Sure. According to Power ISA 2.07: The lvsl and lvsr instructions can be used to create the permute control vector to be used by a subsequent vperm instruction.
So the lvsl and lvsr instructions check 'sh' value in order to fill the vector register, if 'sh' is 0 the vector register will be populated as follow 0x000102030405060708090A0B0C0D0E0F this can be done using the following instructions li r9, 0 lvsl LE_MASK, r9, r9 Now we xor each byte with 3 using these instructions vspltisb LE_TEMP, 0x03 vxor LE_MASK, LE_MASK, LE_TEMP The value of the vector register is now 0x03020100070605040B0A09080F0E0D0C If this mask has been used in vperm instruction, that means each word in the source vector will be byte reversed so in the big-endian mode every word of the result will be stored in the destination buffer in little-endian order and that what LE_SWAP32 is meant to do.
On Mon, Sep 28, 2020 at 8:32 PM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
The last patch follows the C implementation but I just figured out a
decent
way to do it.
Thanks! Applied, and pushed on the ppc-chacha-core branch for testing. (Had apply it semi-manually, since the file to patch indents using TAB and those were replaced by spaces in the emailed patch).
+IF_BE(`
- li r9, 0
- lvsl LE_MASK, r9, r9
- vspltisb LE_TEMP, 0x03
- vxor LE_MASK, LE_MASK, LE_TEMP
+')
I think this deserves some comments, on what goes into the register in each step. Clever that the endian conversion corresponds to xoring the byte indices with 3.
Regards, /Niels
-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.
Maamoun TK maamoun.tk@googlemail.com writes:
Sure. According to Power ISA 2.07: The lvsl and lvsr instructions can be used to create the permute control vector to be used by a subsequent vperm instruction.
So the lvsl and lvsr instructions check 'sh' value in order to fill the vector register, if 'sh' is 0 the vector register will be populated as follow 0x000102030405060708090A0B0C0D0E0F this can be done using the following instructions li r9, 0 lvsl LE_MASK, r9, r9 Now we xor each byte with 3 using these instructions vspltisb LE_TEMP, 0x03 vxor LE_MASK, LE_MASK, LE_TEMP The value of the vector register is now 0x03020100070605040B0A09080F0E0D0C
Thanks. I've added some comments about this.
I've also extended the fat setup to check for altivec, using the logic
hwcap = getauxval(AT_HWCAP); ... /* We also need VSX instructions, mainly for load and store. */ features->have_altivec = ((hwcap & (PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_VSX)) == (PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_VSX));
For now, gnu/linux only, patches to get detection working also on freebsd and aix welcome (I think needed fixes will be close to trivial, but I have no easy way to test, and I don't want to commit untested code).
For non-fat builds, the new code is disabled by default, with a configure option --enable-power-altivec.
And I've merged the changes to the master branch. I have some work-in progress code to do 2 or 4 chacha blocks in parallel, but not sure when I will get that into working shape.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se