On Mon, Nov 30, 2020 at 12:37 PM Niels Möller nisse@lysator.liu.se wrote:
Niels Möller nisse@lysator.liu.se writes:
- Does the save and restore of registers look correct? I checked the abi spec, and the intention is to use the part of the 288 byte "Protected zone" below the stack pointer.
There are requirements should be applied when modifying the stack pointer register, I will add the needed rules from https://refspecs.linuxfoundation.org/ELF/ppc64/PPC-elf64abi-1.9.html
- The stack pointer shall maintain quadword alignment. - The stack pointer shall point to the first word of the lowest allocated stack frame, the "back chain" word. The stack shall grow downward, that is, toward lower addresses. The first word of the stack frame shall always point to the previously allocated stack frame (toward higher addresses), except for the first stack frame, which shall have a back chain of 0 (NULL). - The stack pointer shall be decremented and the back chain updated atomically using one of the "Store Double Word with Update" instructions, so that the stack pointer always points to the beginning of a linked list of stack frames.
so to modify r1 you have to allocate additional 8 bytes in the stack to store the old value of r1. The register store sequence will look like:
li r6, 0x10 C set up some... li r7, 0x20 C ...useful... li r8, 0x30 C ...offsets li r9, 0x40 C ...offsets
stdu r1, -0x50(r1) C Save callee-save registers stvx v20, r6, r1 stvx v21, r7, r1 stvx v22, r8, r1 stvx v23, r9, r1
note that the allocated size is rounded up to a multiple of 16 bytes, so that quadword stack alignment is maintained.
and the register restore sequence will look like:
lvx v20, r6, r1 lvx v21, r7, r1 lvx v22, r8, r1 lvx v23, r9, r1 addi r1, r1, 0x50
BTW since there is no function called while the register of the stack frame is modified, I think it's fine to not follow the rules and keep the store and restore sequences as are without any modification.
2. The use of the QR macro means that there's no careful
instruction-level interleaving of independent instructions. Do you think it's beneficial to do manual interleaving (like in chacha_2core.asm), or can it be left to the out-of-order execution logic run sort it out and execute instructions in parallel?
You'll get performance benefits by interleaving the independent instructions in this case, I can estimate the increase of performance around 20%-30%.
- Is there any clever way to construct the vector {0,1,2,3} in a register, instead of loading it from memory?
I can think of this method:
li r10,0 lvsl T0,0,r10 C 0x000102030405060708090A0B0C0D0E0F vupkhsb T0,T0 C 0x00000001000200030004000500060007 vupkhsh T0,T0 C 0x00000000000000010000000200000003
regards, Mamone