Re: [Aarch64] Optimize SHA1 Compress

23 May 2021

On Sun, May 23, 2021 at 10:52 AM Niels Möller nisse@lysator.liu.se wrote:
...
Maamoun TK maamoun.tk@googlemail.com writes:
...
This patch optimizes SHA1 compress function for arm64 architecture by
taking advantage of SHA-1 instructions of Armv8 crypto extension.
The SHA-1 instructions:
SHA1C: SHA1 hash update (choose)
SHA1H: SHA1 fixed rotate
SHA1M: SHA1 hash update (majority)
SHA1P: SHA1 hash update (parity)
SHA1SU0: SHA1 schedule update 0
SHA1SU1: SHA1 schedule update 1
Can you add this brief summary of instructions as a comment in the asm
file?
Done! I'll attach a patch at the end of the message that performs slightly
better as well.
Algorithm         mode Mbyte/s
              sha1       update  800.80
      openssl sha1       update  849.17
         hmac-sha1     64 bytes  166.10
         hmac-sha1    256 bytes  409.24
         hmac-sha1   1024 bytes  636.98
         hmac-sha1   4096 bytes  739.20
         hmac-sha1   single msg  775.67
...
Benchmark on gcc117 instance of CFarm before applying the patch:
...
     Algorithm         mode        Mbyte/s
     sha1               update       214.16
     openssl sha1  update       849.44

...
Benchmark on gcc117 instance of CFarm after applying the patch:
         Algorithm         mode        Mbyte/s
         sha1                update       795.57
         openssl sha1   update       849.25
Great speedup! Any idea why openssl is still slightly faster?
Sure, OpenSSL implementation uses a loop inside SH1 update function which
eliminates the constant initialization and state loading/sotring for each
block while nettle does that for every block iteration.
...
...
+define(`TMP0', `v21')
+define(`TMP1', `v22')
Not sure I understand how these are used, but it looks like the TMP
variables are used in some way for the message expansion state? E.g.,
TMP0 assigned in the code for rounds 0-3, and this value used in the
code for rounds 8-11. Other implementations don't need extra state for
this, but just modifies the 16 message words in-place.
Modifying the message words in-place will change the value used by
'sha1su0' and 'sha1su1' instructions. According to ARM® A64 Instruction Set
Architecture:
SHA1SU0 <Vd>.4S, <Vn>.4S, <Vm>.4S
<Vd> Is the name of the SIMD&FP source and destination register
.
.
SHA1SU1 <Vd>.4S, <Vn>.4S
<Vd> Is the name of the SIMD&FP source and destination register
.
.
So using TMP variable is necessary here. I can't think of any replacement,
let me know how the other implementations handle this case.
It would be nice to either make the TMP registers more temporary (i.e.,
...
no round depends on the value in these registers from previous rounds)
and keep needed state only on the MSG variables. Or rename them to give
a better hint on how they're used.
Done! Yield a slight performance increase btw.
...
...
+C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)



+PROLOGUE(nettle_sha1_compress)

C Initialize constants
mov            w2,#0x7999
movk           w2,#0x5A82,lsl #16
dup            CONST0.4s,w2
mov            w2,#0xEBA1
movk           w2,#0x6ED9,lsl #16
dup            CONST1.4s,w2
mov            w2,#0xBCDC
movk           w2,#0x8F1B,lsl #16
dup            CONST2.4s,w2
mov            w2,#0xC1D6
movk           w2,#0xCA62,lsl #16
dup            CONST3.4s,w2

Maybe would be clearer or more efficient to load these from memory? Not
sure if there's an nice and consice way to load the four 32-bit values
into a 128-bit, and then copy/duplicate them into the four const
registers.
We can load all the constants (including duplicate values) from memory with
one instruction. The issue is how to get the data address properly for
every supported abi! By far I saw solutions with multiple paths for
different abi which I don't really like, the easiest solution is to define
the data in the .text section to make sure the address is near enough to be
loaded with certain instruction. Do you want to do that?
...
...

C Load message
ld1            {MSG0.16b,MSG1.16b,MSG2.16b,MSG3.16b},[INPUT]

C Reverse for little endian
rev32          MSG0.16b,MSG0.16b
rev32          MSG1.16b,MSG1.16b
rev32          MSG2.16b,MSG2.16b
rev32          MSG3.16b,MSG3.16b

How does this work on big-endian? The ld1 with .16b is endian-neutral
(according to the README), that means we always get the wrong order, and
then we do unconditional byteswapping? Maybe add a comment. Not sure if
it's worth the effort to make it work differently (ld1 .4w on
big-endian)? It's going to be a pretty small fraction of the per-block
processing.
We have an intensive discussion about that in the GCM patch. The short
story, this patch should work well for both endianness modes. However, it's
not the same way we use in GCM patch to handle the endianness variation, to
follow GCM patch way we can do:
C Load message
    ld1            {MSG0.4s,MSG1.4s,MSG2.4s,MSG3.4s},[INPUT]
C Reverse for little endian
IF_LE(`
    rev32          MSG0.16b,MSG0.16b
    rev32          MSG1.16b,MSG1.16b
    rev32          MSG2.16b,MSG2.16b
    rev32          MSG3.16b,MSG3.16b
')
regards,
Mamone
---
 arm64/crypto/sha1-compress.asm | 93
+++++++++++++++++++++++-------------------
 1 file changed, 50 insertions(+), 43 deletions(-)

diff --git a/arm64/crypto/sha1-compress.asm b/arm64/crypto/sha1-compress.asm
index f261c93d..9f7d9f37 100644
--- a/arm64/crypto/sha1-compress.asm
+++ b/arm64/crypto/sha1-compress.asm
@@ -30,6 +30,15 @@ ifelse(`
    not, see http://www.gnu.org/licenses/.
 ')
+C This implementation uses the SHA-1 instructions of Armv8 crypto
+C extension.
+C SHA1C: SHA1 hash update (choose)
+C SHA1H: SHA1 fixed rotate
+C SHA1M: SHA1 hash update (majority)
+C SHA1P: SHA1 hash update (parity)
+C SHA1SU0: SHA1 schedule update 0
+C SHA1SU1: SHA1 schedule update 1
+
 .file "sha1-compress.asm"
 .arch armv8-a+crypto
@@ -53,8 +62,7 @@ define(`ABCD_SAVED', `v17')
 define(`E0', `v18')
 define(`E0_SAVED', `v19')
 define(`E1', `v20')
-define(`TMP0', `v21')
-define(`TMP1', `v22')
+define(`TMP', `v21')
C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)
@@ -92,140 +100,139 @@ PROLOGUE(nettle_sha1_compress)
     rev32          MSG2.16b,MSG2.16b
     rev32          MSG3.16b,MSG3.16b
-    add            TMP0.4s,MSG0.4s,CONST0.4s
-    add            TMP1.4s,MSG1.4s,CONST0.4s
-
     C Rounds 0-3
+    add            TMP.4s,MSG0.4s,CONST0.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1c          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG2.4s,CONST0.4s
+    sha1c          QFP(ABCD),SFP(E0),TMP.4s
     sha1su0        MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 4-7
+    add            TMP.4s,MSG1.4s,CONST0.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1c          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG3.4s,CONST0.4s
+    sha1c          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG0.4s,MSG3.4s
     sha1su0        MSG1.4s,MSG2.4s,MSG3.4s
C Rounds 8-11
+    add            TMP.4s,MSG2.4s,CONST0.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1c          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG0.4s,CONST0.4s
+    sha1c          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG1.4s,MSG0.4s
     sha1su0        MSG2.4s,MSG3.4s,MSG0.4s
C Rounds 12-15
+    add            TMP.4s,MSG3.4s,CONST0.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1c          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG1.4s,CONST1.4s
+    sha1c          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG2.4s,MSG1.4s
     sha1su0        MSG3.4s,MSG0.4s,MSG1.4s
C Rounds 16-19
+    add            TMP.4s,MSG0.4s,CONST0.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1c          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG2.4s,CONST1.4s
+    sha1c          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG3.4s,MSG2.4s
     sha1su0        MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 20-23
+    add            TMP.4s,MSG1.4s,CONST1.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG3.4s,CONST1.4s
+    sha1p          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG0.4s,MSG3.4s
     sha1su0        MSG1.4s,MSG2.4s,MSG3.4s
C Rounds 24-27
+    add            TMP.4s,MSG2.4s,CONST1.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG0.4s,CONST1.4s
+    sha1p          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG1.4s,MSG0.4s
     sha1su0        MSG2.4s,MSG3.4s,MSG0.4s
C Rounds 28-31
+    add            TMP.4s,MSG3.4s,CONST1.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG1.4s,CONST1.4s
+    sha1p          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG2.4s,MSG1.4s
     sha1su0        MSG3.4s,MSG0.4s,MSG1.4s
C Rounds 32-35
+    add            TMP.4s,MSG0.4s,CONST1.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG2.4s,CONST2.4s
+    sha1p          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG3.4s,MSG2.4s
     sha1su0        MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 36-39
+    add            TMP.4s,MSG1.4s,CONST1.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG3.4s,CONST2.4s
+    sha1p          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG0.4s,MSG3.4s
     sha1su0        MSG1.4s,MSG2.4s,MSG3.4s
C Rounds 40-43
+    add            TMP.4s,MSG2.4s,CONST2.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1m          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG0.4s,CONST2.4s
+    sha1m          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG1.4s,MSG0.4s
     sha1su0        MSG2.4s,MSG3.4s,MSG0.4s
C Rounds 44-47
+    add            TMP.4s,MSG3.4s,CONST2.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1m          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG1.4s,CONST2.4s
+    sha1m          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG2.4s,MSG1.4s
     sha1su0        MSG3.4s,MSG0.4s,MSG1.4s
C Rounds 48-51
+    add            TMP.4s,MSG0.4s,CONST2.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1m          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG2.4s,CONST2.4s
+    sha1m          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG3.4s,MSG2.4s
     sha1su0        MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 52-55
+    add            TMP.4s,MSG1.4s,CONST2.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1m          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG3.4s,CONST3.4s
+    sha1m          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG0.4s,MSG3.4s
     sha1su0        MSG1.4s,MSG2.4s,MSG3.4s
C Rounds 56-59
+    add            TMP.4s,MSG2.4s,CONST2.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1m          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG0.4s,CONST3.4s
+    sha1m          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG1.4s,MSG0.4s
     sha1su0        MSG2.4s,MSG3.4s,MSG0.4s
C Rounds 60-63
+    add            TMP.4s,MSG3.4s,CONST3.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG1.4s,CONST3.4s
+    sha1p          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG2.4s,MSG1.4s
     sha1su0        MSG3.4s,MSG0.4s,MSG1.4s
C Rounds 64-67
+    add            TMP.4s,MSG0.4s,CONST3.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E0),TMP0.4s
-    add            TMP0.4s,MSG2.4s,CONST3.4s
+    sha1p          QFP(ABCD),SFP(E0),TMP.4s
     sha1su1        MSG3.4s,MSG2.4s
     sha1su0        MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 68-71
+    add            TMP.4s,MSG1.4s,CONST3.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E1),TMP1.4s
-    add            TMP1.4s,MSG3.4s,CONST3.4s
+    sha1p          QFP(ABCD),SFP(E1),TMP.4s
     sha1su1        MSG0.4s,MSG3.4s
C Rounds 72-75
+    add            TMP.4s,MSG2.4s,CONST3.4s
     sha1h          SFP(E1),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E0),TMP0.4s
+    sha1p          QFP(ABCD),SFP(E0),TMP.4s
C Rounds 76-79
+    add            TMP.4s,MSG3.4s,CONST3.4s
     sha1h          SFP(E0),SFP(ABCD)
-    sha1p          QFP(ABCD),SFP(E1),TMP1.4s
+    sha1p          QFP(ABCD),SFP(E1),TMP.4s
C Combine state
     add            E0.4s,E0.4s,E0_SAVED.4s
-- 
2.25.1

    

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Aarch64] Optimize SHA1 Compress