Re: [PATCH v3] powerpc64: Add optimized assembly for sha256-compress-n

18 Jun 2024

      On Fri, 2024-06-07 at 14:08 +0200, Niels Möller wrote:
...
Eric Richter erichte@linux.ibm.com writes:
...
+C ROUND(A B C D E F G H R EXT)
+define(`ROUND', `

vadduwm VT1, VK, IV($9)               C VT1: k+W
vadduwm VT4, $8, VT1                  C VT4: H+k+W

lxvw4x VSR(VK), TK, K                C Load Key
addi TK, TK, 4               C Increment Pointer to next key

vadduwm VT2, $4, $8               C VT2: H+D
vadduwm VT2, VT2, VT1                 C VT2: H+D+k+W

Could the above two instructions be changed to
vadduwm VT2, VT4, $4    C Should be the same,(H+k+W) + D
(which would need one less register)? I realize there's slight change
in
the dependency chain. Do you know how many cycles one of these rounds
takes, and what's the bottleneck (I would guess either latency of the
dependency chain between rounds, or throughput of one of the
execution
units, or instruction issue rate).
Theoretically it should be about 10 cycles per round, but the actual
measured performance doesn't quite hit that due to various quirks with
scheduling.
With this change, I'm getting about a +1 MB/s gain on hmac 256 bytes,
but a slight loss of speed for the rest.
...
...
+define(`LOAD', `

IF_BE(`lxvw4x VSR(IV($1)), m4_unquote(TC`'eval(($1 % 4) * 4)),

INPUT')

IF_LE(`
lxvd2x VSR(IV($1)), m4_unquote(TC`'eval(($1 % 4) * 4)), INPUT
vperm IV($1), IV($1), IV($1), VT0
')

+')

+define(`DOLOADS', `

IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')
LOAD(0)
LOAD(1)
LOAD(2)
LOAD(3)

If you pass the right TCx register as argument to the load macro, you
don't need the m4 eval thing, which could make it a bit more
readable, imo.
...

C Store non-volatile registers

li T0, -8
li T1, -24
stvx v20, T0, SP
stvx v21, T1, SP
subi T0, T0, 32
subi T1, T1, 32

This could be probably be arranged with fewer instructions by having
one
register that is decremented as we move down in the guard area, and
registers with constant values for indexing.
...

C Reload initial state from VSX registers
xxlor VSR(VT0), VSXA, VSXA
xxlor VSR(VT1), VSXB, VSXB
xxlor VSR(VT2), VSXC, VSXC
xxlor VSR(VT3), VSXD, VSXD
xxlor VSR(VT4), VSXE, VSXE
xxlor VSR(SIGA), VSXF, VSXF
xxlor VSR(SIGE), VSXG, VSXG
xxlor VSR(VK), VSXH, VSXH

vadduwm VSA, VSA, VT0
vadduwm VSB, VSB, VT1
vadduwm VSC, VSC, VT2
vadduwm VSD, VSD, VT3
vadduwm VSE, VSE, VT4
vadduwm VSF, VSF, SIGA
vadduwm VSG, VSG, SIGE
vadduwm VSH, VSH, VK

It's a pity that there seems to be no useful xxadd* instructions? Do
you
need all eight temporary registers, or would you get the same speed
doing just four at a time, i.e., 4 xxlor instructions, 4 vadduwm, 4
xxlor, 4 vadduwm? There's no alias "xxmov" or the like that could be
used instead of xxlor?
Unfortunately most of the VSX instructions (particularly those in the
p8 ISA) are for floating point operations, using them in this way is a
bit of a hack. I'll test four at a time, but it will likely be similar
performance unless the xxlor's are issued on a different unit.
I'm not aware of an xxmov/xxmr extended mnemonic, but this could always
be macroed instead for clarity.
...
Thanks for the update!
/Niels
Thanks for merging! I'll have a clean-up patch up soon, hopefully with
the SHA512 implementation as well.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [PATCH v3] powerpc64: Add optimized assembly for sha256-compress-n