Re: PPC chacha

30 Nov 2020

      On Mon, Nov 30, 2020 at 12:37 PM Niels Möller nisse@lysator.liu.se wrote:
...
Niels Möller nisse@lysator.liu.se writes:

Does the save and restore of registers look correct? I checked the
abi spec, and the intention is to use the part of the 288 byte
"Protected zone" below the stack pointer.

There are requirements should be applied when modifying the stack pointer
register, I will add the needed rules from
https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi-1.9.html
- The stack pointer shall maintain quadword alignment.
- The stack pointer shall point to the first word of the lowest allocated
stack frame, the "back chain" word. The stack shall grow downward, that is,
toward lower addresses. The first word of the stack frame shall always
point to the previously allocated stack frame (toward higher addresses),
except for the first stack frame, which shall have a back chain of 0 (NULL).
- The stack pointer shall be decremented and the back chain updated
atomically using one of the "Store Double Word with Update" instructions,
so that the stack pointer always points to the beginning of a linked list
of stack frames.
so to modify r1 you have to allocate additional 8 bytes in the stack to
store the old value of r1. The register store sequence will look like:
li      r6, 0x10        C set up some...
        li      r7, 0x20        C ...useful...
        li      r8, 0x30        C ...offsets
        li      r9, 0x40        C ...offsets
stdu    r1, -0x50(r1)   C Save callee-save registers
        stvx    v20, r6, r1
        stvx    v21, r7, r1
        stvx    v22, r8, r1
        stvx    v23, r9, r1
note that the allocated size is rounded up to a multiple of 16 bytes, so
that quadword stack alignment is maintained.
and the register restore sequence will look like:
lvx     v20, r6, r1
        lvx     v21, r7, r1
        lvx     v22, r8, r1
        lvx     v23, r9, r1
        addi    r1, r1, 0x50
BTW since there is no function called while the register of the stack frame
is modified, I think it's fine to not follow the rules and keep the store
and restore sequences as are without any modification.
2. The use of the QR macro means that there's no careful
...
instruction-level interleaving of independent instructions. Do you
   think it's beneficial to do manual interleaving (like in
   chacha_2core.asm), or can it be left to the out-of-order execution
   logic run sort it out and execute instructions in parallel?
You'll get performance benefits by interleaving the independent
instructions in this case, I can estimate the increase of performance
around 20%-30%.
...

Is there any clever way to construct the vector {0,1,2,3} in a
register, instead of loading it from memory?

I can think of this method:
li               r10,0
lvsl           T0,0,r10      C 0x000102030405060708090A0B0C0D0E0F
vupkhsb   T0,T0          C 0x00000001000200030004000500060007
vupkhsh   T0,T0          C 0x00000000000000010000000200000003
regards,
Mamone

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: PPC chacha