On Fri, Nov 20, 2020 at 3:40 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
It could likely be speedup further by processing 2, 3 or 4 blocks in parallel.
I've given 2 blocks in parallel a try, but not quite working yet. My work-in-progress code below.
When I test it on the gcc112 machine, it fails with an illegal instruction (SIGILL) on this line, close to function entry:
.globl _nettle_chacha_2core .type _nettle_chacha_2core,%function .align 5 _nettle_chacha_2core: addis 2,12,(.TOC.-_nettle_chacha_2core)@ha addi 2,2,(.TOC.-_nettle_chacha_2core)@l .localentry _nettle_chacha_2core, .-_nettle_chacha_2core
li r8, 0x30 vspltisw v1, 1
=> vextractuw v1, v1, 0
I don't understand, from the manual, what's wrong with this. The intention of this piece of code is just to construct the value {1, 0, 0, 0} in one of the vector registers. Maybe there's a better way to do that?
GCC112 is a POWER8 machine. According to the POWER manual, vextractuw is a POWER9 instruction.
POWER8 manual: https://openpowerfoundation.org/?resource_lib=power8-processor-users-manual POWER9 manual: https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual
Jeff