Re: PPC chacha

20 Nov 2020

      On Fri, Nov 20, 2020 at 3:40 PM Niels Möller nisse@lysator.liu.se wrote:
...
nisse@lysator.liu.se (Niels Möller) writes:
...
It could likely be speedup further by processing 2, 3 or 4 blocks in
parallel.
I've given 2 blocks in parallel a try, but not quite working yet. My
work-in-progress code below.
When I test it on the gcc112 machine, it fails with an illegal
instruction (SIGILL) on this line, close to function entry:
.globl _nettle_chacha_2core
  .type _nettle_chacha_2core,%function
  .align 5
  _nettle_chacha_2core:
  addis 2,12,(.TOC.-_nettle_chacha_2core)@ha
  addi 2,2,(.TOC.-_nettle_chacha_2core)@l
  .localentry _nettle_chacha_2core, .-_nettle_chacha_2core
      li      r8, 0x30
      vspltisw v1, 1

=>      vextractuw v1, v1, 0
I don't understand, from the manual, what's wrong with this. The
intention of this piece of code is just to construct the value {1, 0, 0,
0} in one of the vector registers. Maybe there's a better way to do
that?
GCC112 is a POWER8 machine. According to the POWER manual, vextractuw
is a POWER9 instruction.
POWER8 manual: https://openpowerfoundation.org/?resource_lib=power8-processor-users-manual
POWER9 manual: https://openpowerfoundation.org/?resource_lib=power9-processor-users-manual
Jeff

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: PPC chacha