On Fri, Nov 20, 2020 at 3:40 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
It could likely be speedup further by processing 2, 3 or 4 blocks in parallel.
I've given 2 blocks in parallel a try, but not quite working yet. My work-in-progress code below.
When I test it on the gcc112 machine, it fails with an illegal instruction (SIGILL) on this line, close to function entry:
.globl _nettle_chacha_2core .type _nettle_chacha_2core,%function .align 5 _nettle_chacha_2core: addis 2,12,(.TOC.-_nettle_chacha_2core)@ha addi 2,2,(.TOC.-_nettle_chacha_2core)@l .localentry _nettle_chacha_2core, .-_nettle_chacha_2core
li r8, 0x30 vspltisw v1, 1
=> vextractuw v1, v1, 0
I don't understand, from the manual, what's wrong with this. The intention of this piece of code is just to construct the value {1, 0, 0, 0} in one of the vector registers. Maybe there's a better way to do that?
vextractuw is a Power9 instruction and gcc112 is a Power8 system. The processor does not support the instruction.
gcc135 is a Power9 system.
Thanks, David