Re: [PATCH v2 1/2] powerpc64: Add optimized assembly for sha256-compress-n

7 May 2024

      On Sun, 2024-05-05 at 16:10 +0200, Niels Möller wrote:
...
Eric Richter erichte@linux.ibm.com writes:
...
This patch introduces an optimized powerpc64 assembly
implementation for
sha256-compress-n. This takes advantage of the vshasigma
instruction, as
well as unrolling loops to best take advantage of running
instructions
in parallel.
Thanks. I'm now having a closer read of the assembly code. Comments
below.
...
+C ROUND(A B C D E F G H R EXT)
+define(`ROUND', `

vadduwm	VT1, VK, IV($9)               C VT1: k+W
vadduwm	VT4, $8, VT1                  C VT4: H+k+W

lxvw4x	VSR(VK), TK, K                C Load Key
addi	TK, TK, 4	              C Increment Pointer

to next key

vadduwm	VT2, $4, $8	              C VT2: H+D
vadduwm	VT2, VT2, VT1                 C VT2:

H+D+k+W

vshasigmaw	SIGE, $5, 1, 0b1111   C Sigma(E)  Se
vshasigmaw	SIGA, $1, 1, 0        C Sigma(A)  Sa

vxor	VT3, $2, $3                   C VT3: b^c
vsel	VT0, $7, $6, $5	              C VT0: Ch.
vsel	VT3, $3, $1, VT3              C VT3: Maj(a,b,c)

vadduwm	VT4, VT4, VT0                 C VT4: Hkw +

Ch.

vadduwm	VT3, VT3, VT4                 C VT3: HkW +

Ch. + Maj.

vadduwm	VT0, VT0, VT2                 C VT0: Ch. +

DHKW

vadduwm	$8, SIGE, SIGA                C Anext: Se
Sa
vadduwm	$4, VT0, SIGE                 C Dnext: Ch.
DHKW + Se
vadduwm	$8, $8, VT3                   C Anext:

Se+Sa+HkW+Ch.+Maj.

C Schedule (data) for 16th round in future
C Extend W[i]
ifelse(`$10', `1', `
vshasigmaw	SIGE, IV($9 + 14), 0, 0b1111

vshasigmaw	SIGA, IV($9 + 1), 0, 0b0000

vadduwm		IV($9), IV($9), SIGE

vadduwm		IV($9), IV($9), SIGA

vadduwm		IV($9), IV($9), IV($9 + 9)

')

+')
I think it would be a bit simpler to take out the extend logic to its
own macro.
...
+define(`EXTENDROUND',	`ROUND($1, $2, $3, $4, $5, $6, $7, $8, $9,
1)')
If you do that, then you would define

  define(`EXTENDROUND',	`ROUND($1, $2, $3, $4, $5, $6, $7, $8, $9)
EXTEND($9)')
(In other related code, input expansion is done at the beginning of a
round iteration rather than at the end, but doing at the end like you
do
may be better scheduling).
Makes sense, I'll move that extend logic into its own macro.
You are correct, the expansion logic was moved to the end of the round
for an improvement to scheduling on the CPU. The vshasigma instructions
take more cycles and are scheduled on a different unit than the other
arithmetic operations. This allows those to work in parallel with the
beginning of the next round, as there are no dependent registers until
the next vshasigma instructions in-round.
...
...
+define(`LOAD', `

IF_BE(`lxvw4x	VSR(IV($1)), 0, INPUT')
IF_LE(`
lxvd2x	VSR(IV($1)), 0, INPUT

vperm	IV($1), IV($1), IV($1), VT0

')
addi	INPUT, INPUT, 4

+')

+define(`DOLOADS', `

IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')

Could you have a dedicated register for the permutation constant, and
load it only once at function entry? If you have general registers to
spare, it could also make sense to use, e.g., three registers for the
contant values 16, 32, 48, and use for indexing. Then you don't need
to
update the INPUT pointer as often, and you can use the same constants
for other load/store sequences as well.
There are plenty of GPRs to spare, I will test and bench a few options
for using more GPRs as indexes.
As for VRs, unfortunately the current implementation uses all 32 VRs:
 16 for W[i]
 8 for state
 7 for round arithmetic (two of these specifically for sigma, to avoid
a dependency bubble)
 1 for storing the key constant K
That said, I'm going to experiment with some VSX instructions to see if
it is possible to spill over certain operations into VSRs, without
needing an explicit copy back from VSR to VR.
...
...

LOAD(0)
LOAD(1)
LOAD(2)
LOAD(3)

...
+PROLOGUE(_nettle_sha256_compress_n)

cmpwi	0, NUMBLOCKS, 0
ble	0, .done
mtctr	NUMBLOCKS

C Store non-volatile registers
subi	SP, SP, 64+(12*16)
std	T0,	24(SP)
std	T1,	16(SP)
std	COUNT,	8(SP)

For save/restore of registers, I prefer to use the register names,
not
the defined symbols. And T0, T1, COUNT are defined to use r7, r8,
r10,
which *are* volatile, right?
Ah yep, good catch!
...
Does the data stored fit in the 288 byte "protected zone"? If so,
probably best to not modify the stack pointer.
At the moment it should as I'm currently moving the stack pointer 256
bytes. With the removal of the volatile GPRs and the addition of new
index GPRs, I think it should still fit in the 288 byte zone. I will
include this in the next version if it fits.
...
...

li	T0, 32
stvx	v20, 0, SP
subi	T0, T0, 16
stvx	v21, T0, SP

Here it would also help a bit to allocate constants 16, 32, 48 in
registers.
...

subi	T0, T0, 16
stvx	v22, T0, SP
subi	T0, T0, 16
stvx	v23, T0, SP
subi	T0, T0, 16
stvx	v24, T0, SP
subi	T0, T0, 16
stvx	v25, T0, SP
subi	T0, T0, 16
stvx	v26, T0, SP
subi	T0, T0, 16
stvx	v27, T0, SP
subi	T0, T0, 16
stvx	v28, T0, SP
subi	T0, T0, 16
stvx	v29, T0, SP
subi	T0, T0, 16
stvx	v30, T0, SP
subi	T0, T0, 16
stvx	v31, T0, SP

C Load state values
li	T0, 16
lxvw4x	VSR(VSA), 0, STATE	C VSA contains A,B,C,D
lxvw4x	VSR(VSE), T0, STATE	C VSE contains E,F,G,H

+.loop:

li	TK, 0
lxvw4x	VSR(VK), TK, K
addi	TK, TK, 4

DOLOADS

C "permute" state from VSA containing A,B,C,D into

VSA,VSB,VSC,VSD
Can you give a bit more detail on this permutation? Does the main
round
operations only use 32 bits each from the state registers? There's no
reasonable way to use a more compact representation?
Correct, state registers VSA, VSB, VSC, etc only use the first 32-bits
of their respective vector register for arithmetic in a round (64-bits
in the logically identical sha512).
Unfortunately compacting state values A,B,C,D into a single VR
introduces some performance problems when trying to do steps like:
   vsel   VT3, $3, $1, VT3  C VT3: Maj(a,b,c)
as this would require additional instructions extracting A,B,C into
three VRs.
So this is largely a scalar implementation using vector instructions,
as there is no scalar equivalent to vshasigma.
That said, this "permutation" allows the loading of four state values
in a single load instruction rather than four separate loads, with the
tradeoff of needing more arithmetic instructions to permute them into
place. I think this should end up being faster (combined with parallel
scheduling of the non-dependent vsldoi shifts), but I will run some
tests to confirm this.
...
...

vsldoi	VSB, VSA, VSA, 4
vsldoi	VSF, VSE, VSE, 4

vsldoi	VSC, VSA, VSA, 8
vsldoi	VSG, VSE, VSE, 8

vsldoi	VSD, VSA, VSA, 12
vsldoi	VSH, VSE, VSE, 12

EXTENDROUNDS
EXTENDROUNDS
EXTENDROUNDS
NOEXTENDROUNDS

C Reload initial state from stack
li	T0, 16
lxvw4x	VSR(VT0), 0, STATE	C VSA contains A,B,C,D
lxvw4x	VSR(VT1), T0, STATE	C VSE contains E,F,G,H

C Repack VSA,VSB,VSC,VSD into VSA,VSE for storing
vmrghw	VSA, VSA, VSB
vmrghw	VSC, VSC, VSD
vmrghw	VSE, VSE, VSF
vmrghw	VSG, VSG, VSH

xxmrghd	VSR(VSA), VSR(VSA), VSR(VSC)
xxmrghd	VSR(VSE), VSR(VSE), VSR(VSG)

vadduwm	VSA, VSA, VT0
vadduwm	VSE, VSE, VT1

It seems unfortunate to have to do this conversion for each iteration
of
the loop, it would be nice if state could be converted to the
most efficient form before entering the loop, and not converted back
until after loop exit. But we probably don't have enoguh registers to
keep the old state exploded into many registers. And load/store of
exploded state doesn't seem that attractive either.
The necessary load/store per loop iteration is probably the worst
offender for performance, as the merges can be scheduled in parallel.
This is where I'd like to see if we can make the best use of VSRs to
temporarily hold the old state, as a VR <-> VSR move *should* be faster
than a load/store to memory.
...
...

li	T0, 16
stxvw4x	VSR(VSA), 0, STATE
stxvw4x	VSR(VSE), T0, STATE

bdnz	.loop

Regards,
/Niels Möller
Thanks for the review!
- Eric

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [PATCH v2 1/2] powerpc64: Add optimized assembly for sha256-compress-n