Re: [PATCH v3] powerpc64: Add optimized assembly for sha256-compress-n

7 Jun 2024

      Eric Richter erichte@linux.ibm.com writes:
...
Dropping p8 support allows the use of the lxvb16x instruction, which does
not need to be permuted, however that is as well a negligible performance
improvement at the cost of dropping a whole cpu set. So I see a few
options:
A) leave as-is, consider storing the mask in a VSX register
B) drop p8 support, use lxvb16x
C) have a compile-time switch to use permute on p8, and use the single
   instruction for p9 an up.
I'd say leave as is (unless we find some way to get a spare vector register).
...
v3:

use protected zone instead of allocating stack space
add GPRs constants for multiples of 4 for loads
around +3.4 MB/s for sha256 update

move extend logic to its own macro called by EXTENDROUND
use 8 VSX registers to store previous state instead of the stack
around +11.0 MB/s for sha256 update

I think I'd be happy to merge this version, and do any incremental
improvement on top of that. Some comments below:
...
+C ROUND(A B C D E F G H R EXT)
+define(`ROUND', `

vadduwm	VT1, VK, IV($9)               C VT1: k+W
vadduwm	VT4, $8, VT1                  C VT4: H+k+W

lxvw4x	VSR(VK), TK, K                C Load Key
addi	TK, TK, 4	              C Increment Pointer to next key

vadduwm	VT2, $4, $8	              C VT2: H+D
vadduwm	VT2, VT2, VT1                 C VT2: H+D+k+W

Could the above two instructions be changed to
vadduwm VT2, VT4, $4    C Should be the same,(H+k+W) + D
(which would need one less register)? I realize there's slight change in
the dependency chain. Do you know how many cycles one of these rounds
takes, and what's the bottleneck (I would guess either latency of the
dependency chain between rounds, or throughput of one of the execution
units, or instruction issue rate).
...
+define(`LOAD', `

IF_BE(`lxvw4x	VSR(IV($1)), m4_unquote(TC`'eval(($1 % 4) * 4)), INPUT')
IF_LE(`
lxvd2x	VSR(IV($1)), m4_unquote(TC`'eval(($1 % 4) * 4)), INPUT

vperm	IV($1), IV($1), IV($1), VT0

')

+')

+define(`DOLOADS', `

IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')
LOAD(0)
LOAD(1)
LOAD(2)
LOAD(3)

If you pass the right TCx register as argument to the load macro, you
don't need the m4 eval thing, which could make it a bit more readable, imo.
...

C Store non-volatile registers

li	T0, -8
li	T1, -24
stvx	v20, T0, SP
stvx	v21, T1, SP
subi	T0, T0, 32
subi	T1, T1, 32

This could be probably be arranged with fewer instructions by having one
register that is decremented as we move down in the guard area, and
registers with constant values for indexing.
...

C Reload initial state from VSX registers
xxlor	VSR(VT0), VSXA, VSXA
xxlor	VSR(VT1), VSXB, VSXB
xxlor	VSR(VT2), VSXC, VSXC
xxlor	VSR(VT3), VSXD, VSXD
xxlor	VSR(VT4), VSXE, VSXE
xxlor	VSR(SIGA), VSXF, VSXF
xxlor	VSR(SIGE), VSXG, VSXG
xxlor	VSR(VK), VSXH, VSXH

vadduwm	VSA, VSA, VT0
vadduwm	VSB, VSB, VT1
vadduwm	VSC, VSC, VT2
vadduwm	VSD, VSD, VT3
vadduwm	VSE, VSE, VT4
vadduwm	VSF, VSF, SIGA
vadduwm	VSG, VSG, SIGE
vadduwm	VSH, VSH, VK

It's a pity that there seems to be no useful xxadd* instructions? Do you
need all eight temporary registers, or would you get the same speed
doing just four at a time, i.e., 4 xxlor instructions, 4 vadduwm, 4
xxlor, 4 vadduwm? There's no alias "xxmov" or the like that could be
used instead of xxlor?
Thanks for the update!
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [PATCH v3] powerpc64: Add optimized assembly for sha256-compress-n