Originally, the 16 input words were loaded with 16 individual vector load
instructions. This has a side effect where the last three loads would
overread 1/2/3 extra words.
Fix the overread by replacing unnecessary overlapped reads with shifts.
As a consequence, the constant registers for 4,8,12 can be removed, and
also gain about 1~2% in performance.
Signed-off-by: Eric Richter <erichte(a)linux.ibm.com>
---
So this was supposed to be a simple update, removing those eight
unnecessary …
[View More]instructions. However, removing the GPR store instructions
introduced a performance regression of roughly -5% -- more performance lost
than gained in the previous version.
After a lot of trial-and-error, I've landed on this version, which includes
yet another align, this time before the register storage section. I tried
placing it in other locations, as well as changing the align value,
eventually landing on this as the most performant version that I could
find through brute force. Moving the load-immediate for TC16 to before the
register save section marginally improved performance (<1%), and in a less
technical opinion appears better than doing:
ALIGN(16)
li TC16, 16
ALIGN(16)
lxv4wx
lxv4wx
When I have more time, I still intend to do a deeper dive and try to find
out exactly why these aligns are helping/hurting performance. At minimum
I'd like to send a patch for better comments to document the logic, so
future implementations can learn from it.
v2:
- removed store/restore of now unused r14/r15
- moved the initial load of TC16 to be with other load immediates
- changed .align to use the ALIGN macro
- added an ALIGN before the register saving stores
- added comments to ALIGN
powerpc64/p8/sha256-compress-n.asm | 44 +++++++++++-------------------
1 file changed, 16 insertions(+), 28 deletions(-)
diff --git a/powerpc64/p8/sha256-compress-n.asm b/powerpc64/p8/sha256-compress-n.asm
index e08ae132..75666deb 100644
--- a/powerpc64/p8/sha256-compress-n.asm
+++ b/powerpc64/p8/sha256-compress-n.asm
@@ -44,10 +44,7 @@ define(`T1', `r8')
define(`TK', `r9')
define(`COUNT', `r10')
define(`TC0', `0') C Index instructions allow literal 0 instead of a GPR
-define(`TC4', `r11')
-define(`TC8', `r12')
-define(`TC12', `r14')
-define(`TC16', `r15')
+define(`TC16', `r11')
C State registers
define(`VSA', `v0')
@@ -187,24 +184,24 @@ define(`LOAD', `
define(`DOLOADS', `
IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')
LOAD(0, TC0)
- LOAD(1, TC4)
- LOAD(2, TC8)
- LOAD(3, TC12)
+ vsldoi IV(1), IV(0), IV(0), 4
+ vsldoi IV(2), IV(0), IV(0), 8
+ vsldoi IV(3), IV(0), IV(0), 12
addi INPUT, INPUT, 16
LOAD(4, TC0)
- LOAD(5, TC4)
- LOAD(6, TC8)
- LOAD(7, TC12)
+ vsldoi IV(5), IV(4), IV(4), 4
+ vsldoi IV(6), IV(4), IV(4), 8
+ vsldoi IV(7), IV(4), IV(4), 12
addi INPUT, INPUT, 16
LOAD(8, TC0)
- LOAD(9, TC4)
- LOAD(10, TC8)
- LOAD(11, TC12)
+ vsldoi IV(9), IV(8), IV(8), 4
+ vsldoi IV(10), IV(8), IV(8), 8
+ vsldoi IV(11), IV(8), IV(8), 12
addi INPUT, INPUT, 16
LOAD(12, TC0)
- LOAD(13, TC4)
- LOAD(14, TC8)
- LOAD(15, TC12)
+ vsldoi IV(13), IV(12), IV(12), 4
+ vsldoi IV(14), IV(12), IV(12), 8
+ vsldoi IV(15), IV(12), IV(12), 12
addi INPUT, INPUT, 16
')
@@ -216,6 +213,8 @@ PROLOGUE(_nettle_sha256_compress_n)
C Store non-volatile registers
+ ALIGN(16) C Appears necessary for optimal stores
+ li TC16, 16
li T0, -16
li T1, -32
stvx v20, T0, SP
@@ -240,15 +239,8 @@ PROLOGUE(_nettle_sha256_compress_n)
subi T1, T1, 32
stvx v30, T0, SP
stvx v31, T1, SP
- subi T0, T0, 32
- subi T1, T1, 32
- stdx r14, T0, SP
- stdx r15, T1, SP
- li TC4, 4
- li TC8, 8
- li TC12, 12
- li TC16, 16
+ ALIGN(16) C Appears necessary for optimal loads
C Load state values
lxvw4x VSR(VSA), 0, STATE C VSA contains A,B,C,D
@@ -345,10 +337,6 @@ PROLOGUE(_nettle_sha256_compress_n)
subi T1, T1, 32
lvx v30, T0, SP
lvx v31, T1, SP
- subi T0, T0, 32
- subi T1, T1, 32
- ldx r14, T0, SP
- ldx r15, T1, SP
.done:
mr r3, INPUT
--
2.46.0
[View Less]
Originally, the 16 input words were loaded with 16 individual vector load
instructions. This has a side effect where the last three loads would
overread 1/2/3 extra words.
Fix the overread by replacing unnecessary overlapped reads with shifts.
As a consequence, the constant registers for 4,8,12 can be removed, and
also gain about 1~2% in performance.
Signed-off-by: Eric Richter <erichte(a)linux.ibm.com>
---
Some additional comments, since I agree that the loads should be fixed, but
a …
[View More]bit of a performance anomaly appeared:
I thought I had tested this exact method in a previous iteration of the
patches, and saw either no change or a minor performance hit. This time
around I am seeing a slight performance improvement:
(This is output from a script that compares 10-iteration benchmark runs,
the values in parens are the difference from just changing the loads to
shifts)
alg/mode min avg max
====================== ============== ============== ==============
sha256 update 460.64 (+2.45) 467.99 (+6.63) 468.89 (+7.13)
hmac-sha256 64 bytes 126.39 (+0.94) 126.86 (+0.14) 127.42 (+0.08)
hmac-sha256 256 bytes 272.14 (+2.97) 273.07 (+2.07) 273.60 (+1.44)
hmac-sha256 1024 bytes 391.52 (+2.00) 393.17 (+2.03) 395.58 (+3.13)
hmac-sha256 4096 bytes 440.61 (+3.22) 442.46 (+4.50) 444.15 (+5.73)
hmac-sha256 single msg 454.75 (+1.23) 457.51 (+3.85) 459.70 (+5.91)
However, removing the loads for the now unused TC registers now hurts
performance by about the amount gained from using the shifts:
Adding this change:
- li TC4, 4
- li TC8, 8
- li TC12, 12
li TC16, 16
yields the following performance values (parens are difference against the
code that did not remove the li's but did use shifts):
alg/mode min avg max
====================== ============== ============== ==============
sha256 update 462.47 (+1.83) 462.75 (-5.24) 462.87 (-6.02)
hmac-sha256 64 bytes 126.01 (-0.38) 126.69 (-0.17) 127.00 (-0.42)
hmac-sha256 256 bytes 270.13 (-2.01) 271.62 (-1.44) 272.99 (-0.61)
hmac-sha256 1024 bytes 390.79 (-0.73) 391.84 (-1.33) 393.04 (-2.54)
hmac-sha256 4096 bytes 438.60 (-2.01) 438.83 (-3.63) 439.30 (-4.85)
hmac-sha256 single msg 454.52 (-0.23) 454.63 (-2.88) 454.76 (-4.94)
If I add an align after the TC16 load:
li TC16, 16
+ .align 4
C Load state values
lxvw4x VSR(VSA), 0, STATE C VSA contains A,B,C,D
lxvw4x VSR(VSE), TC16, STATE C VSE contains E,F,G,H
performance comes back to about where it was before removing the TC loads
(parens are compared against the previous run):
alg/mode min avg max
====================== ============== ============== ==============
sha256 update 468.29 (+5.82) 468.74 (+5.99) 468.86 (+5.99)
hmac-sha256 64 bytes 126.51 (+0.50) 127.26 (+0.57) 127.78 (+0.78)
hmac-sha256 256 bytes 273.17 (+3.04) 274.51 (+2.89) 276.24 (+3.25)
hmac-sha256 1024 bytes 393.27 (+2.48) 395.18 (+3.34) 397.87 (+4.83)
hmac-sha256 4096 bytes 441.08 (+2.48) 443.21 (+4.38) 445.09 (+5.79)
hmac-sha256 single msg 454.41 (-0.11) 458.30 (+3.67) 460.82 (+6.06)
I suspect with four li instructions, those are issued 4x in parallel, and
then the subsequent (slower) lxvw4x instructions are queued 2x. By removing
the other three li instructions, that li is queued with the first lxvw4x,
but not the second -- causing a stall as the second lxv has to wait for the
parallel queue of the li + lxv before, as it depends on the li completing
first.
This is only my running theory, but unfortunately I don't currently have
the bandwidth to do a deep dive for a <2% performance mystery (that said,
I do intend to return to this when I get the chance, I suspect other areas
of the code could also use a similar level of scrutiny). Do you have any
thoughts or advice on this, or is the proposed approach acceptable for now?
Additional note: I did also try rearranging the LOAD macros with the
shifts, as well as moving around the requisite byte-swap vperms, but did
not receive any performance benefits. It appears doing the load, vperm,
shift, addi in that order appears to be the fastest order.
powerpc64/p8/sha256-compress-n.asm | 33 +++++++++++++-----------------
1 file changed, 14 insertions(+), 19 deletions(-)
diff --git a/powerpc64/p8/sha256-compress-n.asm b/powerpc64/p8/sha256-compress-n.asm
index e08ae132..d3416e2a 100644
--- a/powerpc64/p8/sha256-compress-n.asm
+++ b/powerpc64/p8/sha256-compress-n.asm
@@ -44,10 +44,7 @@ define(`T1', `r8')
define(`TK', `r9')
define(`COUNT', `r10')
define(`TC0', `0') C Index instructions allow literal 0 instead of a GPR
-define(`TC4', `r11')
-define(`TC8', `r12')
-define(`TC12', `r14')
-define(`TC16', `r15')
+define(`TC16', `r11')
C State registers
define(`VSA', `v0')
@@ -187,24 +184,24 @@ define(`LOAD', `
define(`DOLOADS', `
IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')
LOAD(0, TC0)
- LOAD(1, TC4)
- LOAD(2, TC8)
- LOAD(3, TC12)
+ vsldoi IV(1), IV(0), IV(0), 4
+ vsldoi IV(2), IV(0), IV(0), 8
+ vsldoi IV(3), IV(0), IV(0), 12
addi INPUT, INPUT, 16
LOAD(4, TC0)
- LOAD(5, TC4)
- LOAD(6, TC8)
- LOAD(7, TC12)
+ vsldoi IV(5), IV(4), IV(4), 4
+ vsldoi IV(6), IV(4), IV(4), 8
+ vsldoi IV(7), IV(4), IV(4), 12
addi INPUT, INPUT, 16
LOAD(8, TC0)
- LOAD(9, TC4)
- LOAD(10, TC8)
- LOAD(11, TC12)
+ vsldoi IV(9), IV(8), IV(8), 4
+ vsldoi IV(10), IV(8), IV(8), 8
+ vsldoi IV(11), IV(8), IV(8), 12
addi INPUT, INPUT, 16
LOAD(12, TC0)
- LOAD(13, TC4)
- LOAD(14, TC8)
- LOAD(15, TC12)
+ vsldoi IV(13), IV(12), IV(12), 4
+ vsldoi IV(14), IV(12), IV(12), 8
+ vsldoi IV(15), IV(12), IV(12), 12
addi INPUT, INPUT, 16
')
@@ -245,10 +242,8 @@ PROLOGUE(_nettle_sha256_compress_n)
stdx r14, T0, SP
stdx r15, T1, SP
- li TC4, 4
- li TC8, 8
- li TC12, 12
li TC16, 16
+ .align 4
C Load state values
lxvw4x VSR(VSA), 0, STATE C VSA contains A,B,C,D
--
2.46.0
[View Less]