This patch optimizes SHA1 compress function for arm64 architecture by taking advantage of SHA-1 instructions of Armv8 crypto extension. The SHA-1 instructions: SHA1C: SHA1 hash update (choose) SHA1H: SHA1 fixed rotate SHA1M: SHA1 hash update (majority) SHA1P: SHA1 hash update (parity) SHA1SU0: SHA1 schedule update 0 SHA1SU1: SHA1 schedule update 1
The patch is based on sha1-arm.c - ARMv8 SHA extensions using C intrinsics of repository https://github.com/noloader/SHA-Intrinsics by Jeffrey Walton.
The patch passes the testsuite of nettle library and the benchmark numbers are considerably improved but the performance of the overall sha1 hash function doesn't surpass the corresponding openssl numbers.
Benchmark on gcc117 instance of CFarm before applying the patch: Algorithm mode Mbyte/s sha1 update 214.16 openssl sha1 update 849.44 hmac-sha1 64 bytes 61.69 hmac-sha1 256 bytes 131.50 hmac-sha1 1024 bytes 185.20 hmac-sha1 4096 bytes 204.55 hmac-sha1 single msg 210.97
Benchmark on gcc117 instance of CFarm after applying the patch: Algorithm mode Mbyte/s sha1 update 795.57 openssl sha1 update 849.25 hmac-sha1 64 bytes 167.65 hmac-sha1 256 bytes 408.24 hmac-sha1 1024 bytes 636.68 hmac-sha1 4096 bytes 739.42 hmac-sha1 single msg 775.89
--- arm64/crypto/sha1-compress.asm | 245 +++++++++++++++++++++++++++++++++++++++++ arm64/machine.m4 | 7 ++ 2 files changed, 252 insertions(+) create mode 100644 arm64/crypto/sha1-compress.asm
diff --git a/arm64/crypto/sha1-compress.asm b/arm64/crypto/sha1-compress.asm new file mode 100644 index 00000000..bb3f1d35 --- /dev/null +++ b/arm64/crypto/sha1-compress.asm @@ -0,0 +1,245 @@ +C arm64/crypto/sha1-compress.asm + +ifelse(` + Copyright (C) 2021 Mamone Tarsha + + Based on sha1-arm.c - ARMv8 SHA extensions using C intrinsics of + repository https://github.com/noloader/SHA-Intrinsics + sha1-arm.c is written and placed in public domain by Jeffrey Walton, + based on code from ARM, and by Johannes Schneiders, Skip + Hovsmith and Barry O'Rourke for the mbedTLS project. + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + +.file "sha1-compress.asm" +.arch armv8-a+crypto + +.text + +C Register usage: + +define(`STATE', `x0') +define(`INPUT', `x1') + +define(`CONST0', `v0') +define(`CONST1', `v1') +define(`CONST2', `v2') +define(`CONST3', `v3') +define(`MSG0', `v4') +define(`MSG1', `v5') +define(`MSG2', `v6') +define(`MSG3', `v7') +define(`ABCD', `v16') +define(`ABCD_SAVED', `v17') +define(`E0', `v18') +define(`E0_SAVED', `v19') +define(`E1', `v20') +define(`TMP0', `v21') +define(`TMP1', `v22') + +C void nettle_sha1_compress(uint32_t *state, const uint8_t *input) + +PROLOGUE(nettle_sha1_compress) + C Initialize constants + mov w2,#0x7999 + movk w2,#0x5A82,lsl #16 + dup CONST0.4s,w2 + mov w2,#0xEBA1 + movk w2,#0x6ED9,lsl #16 + dup CONST1.4s,w2 + mov w2,#0xBCDC + movk w2,#0x8F1B,lsl #16 + dup CONST2.4s,w2 + mov w2,#0xC1D6 + movk w2,#0xCA62,lsl #16 + dup CONST3.4s,w2 + + C Load state + add x2,STATE,#16 + movi E0.4s,#0 + ld1 {ABCD.4s},[STATE] + ld1 {E0.s}[0],[x2] + + C Save state + mov ABCD_SAVED.16b,ABCD.16b + mov E0_SAVED.16b,E0.16b + + C Load message + ld1 {MSG0.16b,MSG1.16b,MSG2.16b,MSG3.16b},[INPUT] + + C Reverse for little endian + rev32 MSG0.16b,MSG0.16b + rev32 MSG1.16b,MSG1.16b + rev32 MSG2.16b,MSG2.16b + rev32 MSG3.16b,MSG3.16b + + add TMP0.4s,MSG0.4s,CONST0.4s + add TMP1.4s,MSG1.4s,CONST0.4s + + C Rounds 0-3 + sha1h SFP(E1),SFP(ABCD) + sha1c QFP(ABCD),SFP(E0),TMP0.4s + add TMP0.4s,MSG2.4s,CONST0.4s + sha1su0 MSG0.4s,MSG1.4s,MSG2.4s + + C Rounds 4-7 + sha1h SFP(E0),SFP(ABCD) + sha1c QFP(ABCD),SFP(E1),TMP1.4s + add TMP1.4s,MSG3.4s,CONST0.4s + sha1su1 MSG0.4s,MSG3.4s + sha1su0 MSG1.4s,MSG2.4s,MSG3.4s + + C Rounds 8-11 + sha1h SFP(E1),SFP(ABCD) + sha1c QFP(ABCD),SFP(E0),TMP0.4s + add TMP0.4s,MSG0.4s,CONST0.4s + sha1su1 MSG1.4s,MSG0.4s + sha1su0 MSG2.4s,MSG3.4s,MSG0.4s + + C Rounds 12-15 + sha1h SFP(E0),SFP(ABCD) + sha1c QFP(ABCD),SFP(E1),TMP1.4s + add TMP1.4s,MSG1.4s,CONST1.4s + sha1su1 MSG2.4s,MSG1.4s + sha1su0 MSG3.4s,MSG0.4s,MSG1.4s + + C Rounds 16-19 + sha1h SFP(E1),SFP(ABCD) + sha1c QFP(ABCD),SFP(E0),TMP0.4s + add TMP0.4s,MSG2.4s,CONST1.4s + sha1su1 MSG3.4s,MSG2.4s + sha1su0 MSG0.4s,MSG1.4s,MSG2.4s + + C Rounds 20-23 + sha1h SFP(E0),SFP(ABCD) + sha1p QFP(ABCD),SFP(E1),TMP1.4s + add TMP1.4s,MSG3.4s,CONST1.4s + sha1su1 MSG0.4s,MSG3.4s + sha1su0 MSG1.4s,MSG2.4s,MSG3.4s + + C Rounds 24-27 + sha1h SFP(E1),SFP(ABCD) + sha1p QFP(ABCD),SFP(E0),TMP0.4s + add TMP0.4s,MSG0.4s,CONST1.4s + sha1su1 MSG1.4s,MSG0.4s + sha1su0 MSG2.4s,MSG3.4s,MSG0.4s + + C Rounds 28-31 + sha1h SFP(E0),SFP(ABCD) + sha1p QFP(ABCD),SFP(E1),TMP1.4s + add TMP1.4s,MSG1.4s,CONST1.4s + sha1su1 MSG2.4s,MSG1.4s + sha1su0 MSG3.4s,MSG0.4s,MSG1.4s + + C Rounds 32-35 + sha1h SFP(E1),SFP(ABCD) + sha1p QFP(ABCD),SFP(E0),TMP0.4s + add TMP0.4s,MSG2.4s,CONST2.4s + sha1su1 MSG3.4s,MSG2.4s + sha1su0 MSG0.4s,MSG1.4s,MSG2.4s + + C Rounds 36-39 + sha1h SFP(E0),SFP(ABCD) + sha1p QFP(ABCD),SFP(E1),TMP1.4s + add TMP1.4s,MSG3.4s,CONST2.4s + sha1su1 MSG0.4s,MSG3.4s + sha1su0 MSG1.4s,MSG2.4s,MSG3.4s + + C Rounds 40-43 + sha1h SFP(E1),SFP(ABCD) + sha1m QFP(ABCD),SFP(E0),TMP0.4s + add TMP0.4s,MSG0.4s,CONST2.4s + sha1su1 MSG1.4s,MSG0.4s + sha1su0 MSG2.4s,MSG3.4s,MSG0.4s + + C Rounds 44-47 + sha1h SFP(E0),SFP(ABCD) + sha1m QFP(ABCD),SFP(E1),TMP1.4s + add TMP1.4s,MSG1.4s,CONST2.4s + sha1su1 MSG2.4s,MSG1.4s + sha1su0 MSG3.4s,MSG0.4s,MSG1.4s + + C Rounds 48-51 + sha1h SFP(E1),SFP(ABCD) + sha1m QFP(ABCD),SFP(E0),TMP0.4s + add TMP0.4s,MSG2.4s,CONST2.4s + sha1su1 MSG3.4s,MSG2.4s + sha1su0 MSG0.4s,MSG1.4s,MSG2.4s + + C Rounds 52-55 + sha1h SFP(E0),SFP(ABCD) + sha1m QFP(ABCD),SFP(E1),TMP1.4s + add TMP1.4s,MSG3.4s,CONST3.4s + sha1su1 MSG0.4s,MSG3.4s + sha1su0 MSG1.4s,MSG2.4s,MSG3.4s + + C Rounds 56-59 + sha1h SFP(E1),SFP(ABCD) + sha1m QFP(ABCD),SFP(E0),TMP0.4s + add TMP0.4s,MSG0.4s,CONST3.4s + sha1su1 MSG1.4s,MSG0.4s + sha1su0 MSG2.4s,MSG3.4s,MSG0.4s + + C Rounds 60-63 + sha1h SFP(E0),SFP(ABCD) + sha1p QFP(ABCD),SFP(E1),TMP1.4s + add TMP1.4s,MSG1.4s,CONST3.4s + sha1su1 MSG2.4s,MSG1.4s + sha1su0 MSG3.4s,MSG0.4s,MSG1.4s + + C Rounds 64-67 + sha1h SFP(E1),SFP(ABCD) + sha1p QFP(ABCD),SFP(E0),TMP0.4s + add TMP0.4s,MSG2.4s,CONST3.4s + sha1su1 MSG3.4s,MSG2.4s + sha1su0 MSG0.4s,MSG1.4s,MSG2.4s + + C Rounds 68-71 + sha1h SFP(E0),SFP(ABCD) + sha1p QFP(ABCD),SFP(E1),TMP1.4s + add TMP1.4s,MSG3.4s,CONST3.4s + sha1su1 MSG0.4s,MSG3.4s + + C Rounds 72-75 + sha1h SFP(E1),SFP(ABCD) + sha1p QFP(ABCD),SFP(E0),TMP0.4s + + C Rounds 76-79 + sha1h SFP(E0),SFP(ABCD) + sha1p QFP(ABCD),SFP(E1),TMP1.4s + + C Combine state + add E0.4s,E0.4s,E0_SAVED.4s + add ABCD.4s,ABCD.4s,ABCD_SAVED.4s + + C Store state + st1 {ABCD.4s},[STATE] + st1 {E0.s}[0],[x2] + + ret +EPILOGUE(nettle_sha1_compress) diff --git a/arm64/machine.m4 b/arm64/machine.m4 index e69de29b..7df62bcc 100644 --- a/arm64/machine.m4 +++ b/arm64/machine.m4 @@ -0,0 +1,7 @@ +C Get 32-bit floating-point register from vector register +C SFP(VR) +define(`SFP',``s'substr($1,1,len($1))') + +C Get 128-bit floating-point register from vector register +C QFP(VR) +define(`QFP',``q'substr($1,1,len($1))')
Hi Maamoun, you added the standard GNU License to these files, but the repository you mention has no license at all (red flag), and looking at the code it points to on which these files are "based" the current license if ASL 2.0
How much are your patches "based" on the SHA-Intrinsic source?
The perf improvement is great btw.
Simo.
On Fri, 2021-05-14 at 08:45 +0300, Maamoun TK wrote:
This patch optimizes SHA1 compress function for arm64 architecture by taking advantage of SHA-1 instructions of Armv8 crypto extension. The SHA-1 instructions: SHA1C: SHA1 hash update (choose) SHA1H: SHA1 fixed rotate SHA1M: SHA1 hash update (majority) SHA1P: SHA1 hash update (parity) SHA1SU0: SHA1 schedule update 0 SHA1SU1: SHA1 schedule update 1
The patch is based on sha1-arm.c - ARMv8 SHA extensions using C intrinsics of repository https://github.com/noloader/SHA-Intrinsics by Jeffrey Walton.
The patch passes the testsuite of nettle library and the benchmark numbers are considerably improved but the performance of the overall sha1 hash function doesn't surpass the corresponding openssl numbers.
Benchmark on gcc117 instance of CFarm before applying the patch: Algorithm mode Mbyte/s sha1 update 214.16 openssl sha1 update 849.44 hmac-sha1 64 bytes 61.69 hmac-sha1 256 bytes 131.50 hmac-sha1 1024 bytes 185.20 hmac-sha1 4096 bytes 204.55 hmac-sha1 single msg 210.97
Benchmark on gcc117 instance of CFarm after applying the patch: Algorithm mode Mbyte/s sha1 update 795.57 openssl sha1 update 849.25 hmac-sha1 64 bytes 167.65 hmac-sha1 256 bytes 408.24 hmac-sha1 1024 bytes 636.68 hmac-sha1 4096 bytes 739.42 hmac-sha1 single msg 775.89
arm64/crypto/sha1-compress.asm | 245 +++++++++++++++++++++++++++++++++++++++++ arm64/machine.m4 | 7 ++ 2 files changed, 252 insertions(+) create mode 100644 arm64/crypto/sha1-compress.asm
diff --git a/arm64/crypto/sha1-compress.asm b/arm64/crypto/sha1-compress.asm new file mode 100644 index 00000000..bb3f1d35 --- /dev/null +++ b/arm64/crypto/sha1-compress.asm @@ -0,0 +1,245 @@ +C arm64/crypto/sha1-compress.asm
+ifelse(`
- Copyright (C) 2021 Mamone Tarsha
- Based on sha1-arm.c - ARMv8 SHA extensions using C intrinsics of
- repository https://github.com/noloader/SHA-Intrinsics
- sha1-arm.c is written and placed in public domain by Jeffrey Walton,
- based on code from ARM, and by Johannes Schneiders, Skip
- Hovsmith and Barry O'Rourke for the mbedTLS project.
- This file is part of GNU Nettle.
- GNU Nettle is free software: you can redistribute it and/or
- modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free
Software Foundation; either version 3 of the License, or (at your
option) any later version.
- or
* the GNU General Public License as published by the Free
Software Foundation; either version 2 of the License, or (at your
option) any later version.
- or both in parallel, as here.
- GNU Nettle is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- General Public License for more details.
- You should have received copies of the GNU General Public License and
- the GNU Lesser General Public License along with this program. If
- not, see http://www.gnu.org/licenses/.
+')
+.file "sha1-compress.asm" +.arch armv8-a+crypto
+.text
+C Register usage:
+define(`STATE', `x0') +define(`INPUT', `x1')
+define(`CONST0', `v0') +define(`CONST1', `v1') +define(`CONST2', `v2') +define(`CONST3', `v3') +define(`MSG0', `v4') +define(`MSG1', `v5') +define(`MSG2', `v6') +define(`MSG3', `v7') +define(`ABCD', `v16') +define(`ABCD_SAVED', `v17') +define(`E0', `v18') +define(`E0_SAVED', `v19') +define(`E1', `v20') +define(`TMP0', `v21') +define(`TMP1', `v22')
+C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)
+PROLOGUE(nettle_sha1_compress)
- C Initialize constants
- mov w2,#0x7999
- movk w2,#0x5A82,lsl #16
- dup CONST0.4s,w2
- mov w2,#0xEBA1
- movk w2,#0x6ED9,lsl #16
- dup CONST1.4s,w2
- mov w2,#0xBCDC
- movk w2,#0x8F1B,lsl #16
- dup CONST2.4s,w2
- mov w2,#0xC1D6
- movk w2,#0xCA62,lsl #16
- dup CONST3.4s,w2
- C Load state
- add x2,STATE,#16
- movi E0.4s,#0
- ld1 {ABCD.4s},[STATE]
- ld1 {E0.s}[0],[x2]
- C Save state
- mov ABCD_SAVED.16b,ABCD.16b
- mov E0_SAVED.16b,E0.16b
- C Load message
- ld1 {MSG0.16b,MSG1.16b,MSG2.16b,MSG3.16b},[INPUT]
- C Reverse for little endian
- rev32 MSG0.16b,MSG0.16b
- rev32 MSG1.16b,MSG1.16b
- rev32 MSG2.16b,MSG2.16b
- rev32 MSG3.16b,MSG3.16b
- add TMP0.4s,MSG0.4s,CONST0.4s
- add TMP1.4s,MSG1.4s,CONST0.4s
- C Rounds 0-3
- sha1h SFP(E1),SFP(ABCD)
- sha1c QFP(ABCD),SFP(E0),TMP0.4s
- add TMP0.4s,MSG2.4s,CONST0.4s
- sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
- C Rounds 4-7
- sha1h SFP(E0),SFP(ABCD)
- sha1c QFP(ABCD),SFP(E1),TMP1.4s
- add TMP1.4s,MSG3.4s,CONST0.4s
- sha1su1 MSG0.4s,MSG3.4s
- sha1su0 MSG1.4s,MSG2.4s,MSG3.4s
- C Rounds 8-11
- sha1h SFP(E1),SFP(ABCD)
- sha1c QFP(ABCD),SFP(E0),TMP0.4s
- add TMP0.4s,MSG0.4s,CONST0.4s
- sha1su1 MSG1.4s,MSG0.4s
- sha1su0 MSG2.4s,MSG3.4s,MSG0.4s
- C Rounds 12-15
- sha1h SFP(E0),SFP(ABCD)
- sha1c QFP(ABCD),SFP(E1),TMP1.4s
- add TMP1.4s,MSG1.4s,CONST1.4s
- sha1su1 MSG2.4s,MSG1.4s
- sha1su0 MSG3.4s,MSG0.4s,MSG1.4s
- C Rounds 16-19
- sha1h SFP(E1),SFP(ABCD)
- sha1c QFP(ABCD),SFP(E0),TMP0.4s
- add TMP0.4s,MSG2.4s,CONST1.4s
- sha1su1 MSG3.4s,MSG2.4s
- sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
- C Rounds 20-23
- sha1h SFP(E0),SFP(ABCD)
- sha1p QFP(ABCD),SFP(E1),TMP1.4s
- add TMP1.4s,MSG3.4s,CONST1.4s
- sha1su1 MSG0.4s,MSG3.4s
- sha1su0 MSG1.4s,MSG2.4s,MSG3.4s
- C Rounds 24-27
- sha1h SFP(E1),SFP(ABCD)
- sha1p QFP(ABCD),SFP(E0),TMP0.4s
- add TMP0.4s,MSG0.4s,CONST1.4s
- sha1su1 MSG1.4s,MSG0.4s
- sha1su0 MSG2.4s,MSG3.4s,MSG0.4s
- C Rounds 28-31
- sha1h SFP(E0),SFP(ABCD)
- sha1p QFP(ABCD),SFP(E1),TMP1.4s
- add TMP1.4s,MSG1.4s,CONST1.4s
- sha1su1 MSG2.4s,MSG1.4s
- sha1su0 MSG3.4s,MSG0.4s,MSG1.4s
- C Rounds 32-35
- sha1h SFP(E1),SFP(ABCD)
- sha1p QFP(ABCD),SFP(E0),TMP0.4s
- add TMP0.4s,MSG2.4s,CONST2.4s
- sha1su1 MSG3.4s,MSG2.4s
- sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
- C Rounds 36-39
- sha1h SFP(E0),SFP(ABCD)
- sha1p QFP(ABCD),SFP(E1),TMP1.4s
- add TMP1.4s,MSG3.4s,CONST2.4s
- sha1su1 MSG0.4s,MSG3.4s
- sha1su0 MSG1.4s,MSG2.4s,MSG3.4s
- C Rounds 40-43
- sha1h SFP(E1),SFP(ABCD)
- sha1m QFP(ABCD),SFP(E0),TMP0.4s
- add TMP0.4s,MSG0.4s,CONST2.4s
- sha1su1 MSG1.4s,MSG0.4s
- sha1su0 MSG2.4s,MSG3.4s,MSG0.4s
- C Rounds 44-47
- sha1h SFP(E0),SFP(ABCD)
- sha1m QFP(ABCD),SFP(E1),TMP1.4s
- add TMP1.4s,MSG1.4s,CONST2.4s
- sha1su1 MSG2.4s,MSG1.4s
- sha1su0 MSG3.4s,MSG0.4s,MSG1.4s
- C Rounds 48-51
- sha1h SFP(E1),SFP(ABCD)
- sha1m QFP(ABCD),SFP(E0),TMP0.4s
- add TMP0.4s,MSG2.4s,CONST2.4s
- sha1su1 MSG3.4s,MSG2.4s
- sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
- C Rounds 52-55
- sha1h SFP(E0),SFP(ABCD)
- sha1m QFP(ABCD),SFP(E1),TMP1.4s
- add TMP1.4s,MSG3.4s,CONST3.4s
- sha1su1 MSG0.4s,MSG3.4s
- sha1su0 MSG1.4s,MSG2.4s,MSG3.4s
- C Rounds 56-59
- sha1h SFP(E1),SFP(ABCD)
- sha1m QFP(ABCD),SFP(E0),TMP0.4s
- add TMP0.4s,MSG0.4s,CONST3.4s
- sha1su1 MSG1.4s,MSG0.4s
- sha1su0 MSG2.4s,MSG3.4s,MSG0.4s
- C Rounds 60-63
- sha1h SFP(E0),SFP(ABCD)
- sha1p QFP(ABCD),SFP(E1),TMP1.4s
- add TMP1.4s,MSG1.4s,CONST3.4s
- sha1su1 MSG2.4s,MSG1.4s
- sha1su0 MSG3.4s,MSG0.4s,MSG1.4s
- C Rounds 64-67
- sha1h SFP(E1),SFP(ABCD)
- sha1p QFP(ABCD),SFP(E0),TMP0.4s
- add TMP0.4s,MSG2.4s,CONST3.4s
- sha1su1 MSG3.4s,MSG2.4s
- sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
- C Rounds 68-71
- sha1h SFP(E0),SFP(ABCD)
- sha1p QFP(ABCD),SFP(E1),TMP1.4s
- add TMP1.4s,MSG3.4s,CONST3.4s
- sha1su1 MSG0.4s,MSG3.4s
- C Rounds 72-75
- sha1h SFP(E1),SFP(ABCD)
- sha1p QFP(ABCD),SFP(E0),TMP0.4s
- C Rounds 76-79
- sha1h SFP(E0),SFP(ABCD)
- sha1p QFP(ABCD),SFP(E1),TMP1.4s
- C Combine state
- add E0.4s,E0.4s,E0_SAVED.4s
- add ABCD.4s,ABCD.4s,ABCD_SAVED.4s
- C Store state
- st1 {ABCD.4s},[STATE]
- st1 {E0.s}[0],[x2]
- ret
+EPILOGUE(nettle_sha1_compress) diff --git a/arm64/machine.m4 b/arm64/machine.m4 index e69de29b..7df62bcc 100644 --- a/arm64/machine.m4 +++ b/arm64/machine.m4 @@ -0,0 +1,7 @@ +C Get 32-bit floating-point register from vector register +C SFP(VR) +define(`SFP',``s'substr($1,1,len($1))')
+C Get 128-bit floating-point register from vector register +C QFP(VR) +define(`QFP',``q'substr($1,1,len($1))')
On Fri, May 14, 2021 at 3:42 PM Simo Sorce simo@redhat.com wrote:
you added the standard GNU License to these files, but the repository you mention has no license at all (red flag), and looking at the code it points to on which these files are "based" the current license if ASL 2.0
How much are your patches "based" on the SHA-Intrinsic source?
I've written the patch from scratch while keeping in mind how to use the SHA-1 instructions of Arm64 crypto extension from sha1-arm.c in Jeffrey's repository. I've Cced Jeffrey in the main message to get his input on this patch.
regards, Maamoun
Maamoun TK maamoun.tk@googlemail.com writes:
I've written the patch from scratch while keeping in mind how to use the SHA-1 instructions of Arm64 crypto extension from sha1-arm.c in Jeffrey's repository.
If that is the case, avoid phrases like "based on" which are easily misread as implying it's a derived work in the copyright sense.
Regards, /Niels
On Thu, May 20, 2021 at 9:16 PM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
I've written the patch from scratch while keeping in mind how to use the SHA-1 instructions of Arm64 crypto extension from sha1-arm.c in Jeffrey's repository.
If that is the case, avoid phrases like "based on" which are easily misread as implying it's a derived work in the copyright sense.
I'll just mention it in the README file then.
regards, Mamone
I've mentioned it in the README file.
--- arm64/README | 7 +++++++ arm64/crypto/sha1-compress.asm | 6 ------ 2 files changed, 7 insertions(+), 6 deletions(-)
diff --git a/arm64/README b/arm64/README index d2745d57..206bb773 100644 --- a/arm64/README +++ b/arm64/README @@ -83,5 +83,12 @@ particular care must be taken if the loaded data is then to be regarded as elements of e.g. a doubleword vector. Indicies may appear reversed on big-endian systems (because they are).
+Hardware-accelerated SHA Instructions + +The SHA optimized cores are implemented using SHA hashing instructions added +to AArch64 in crypto extensions. The repository [3] illustrates using those +instructions for optimizing SHA hashing functions. + [1] https://github.com/ARM-software/abi-aa/releases/download/2020Q4/aapcs64.pdf [2] https://llvm.org/docs/BigEndianNEON.html +[3] https://github.com/noloader/SHA-Intrinsics diff --git a/arm64/crypto/sha1-compress.asm b/arm64/crypto/sha1-compress.asm index bb3f1d35..f261c93d 100644 --- a/arm64/crypto/sha1-compress.asm +++ b/arm64/crypto/sha1-compress.asm @@ -3,12 +3,6 @@ C arm64/crypto/sha1-compress.asm ifelse(` Copyright (C) 2021 Mamone Tarsha
- Based on sha1-arm.c - ARMv8 SHA extensions using C intrinsics of - repository https://github.com/noloader/SHA-Intrinsics - sha1-arm.c is written and placed in public domain by Jeffrey Walton, - based on code from ARM, and by Johannes Schneiders, Skip - Hovsmith and Barry O'Rourke for the mbedTLS project. - This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or
Maamoun TK maamoun.tk@googlemail.com writes:
Looks pretty good. A few comments and questions below.
This patch optimizes SHA1 compress function for arm64 architecture by taking advantage of SHA-1 instructions of Armv8 crypto extension. The SHA-1 instructions: SHA1C: SHA1 hash update (choose) SHA1H: SHA1 fixed rotate SHA1M: SHA1 hash update (majority) SHA1P: SHA1 hash update (parity) SHA1SU0: SHA1 schedule update 0 SHA1SU1: SHA1 schedule update 1
Can you add this brief summary of instructions as a comment in the asm file?
Benchmark on gcc117 instance of CFarm before applying the patch: Algorithm mode Mbyte/s sha1 update 214.16 openssl sha1 update 849.44
Benchmark on gcc117 instance of CFarm after applying the patch: Algorithm mode Mbyte/s sha1 update 795.57 openssl sha1 update 849.25
Great speedup! Any idea why openssl is still slightly faster?
+define(`TMP0', `v21') +define(`TMP1', `v22')
Not sure I understand how these are used, but it looks like the TMP variables are used in some way for the message expansion state? E.g., TMP0 assigned in the code for rounds 0-3, and this value used in the code for rounds 8-11. Other implementations don't need extra state for this, but just modifies the 16 message words in-place.
It would be nice to either make the TMP registers more temporary (i.e., no round depends on the value in these registers from previous rounds) and keep needed state only on the MSG variables. Or rename them to give a better hint on how they're used.
+C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)
+PROLOGUE(nettle_sha1_compress)
- C Initialize constants
- mov w2,#0x7999
- movk w2,#0x5A82,lsl #16
- dup CONST0.4s,w2
- mov w2,#0xEBA1
- movk w2,#0x6ED9,lsl #16
- dup CONST1.4s,w2
- mov w2,#0xBCDC
- movk w2,#0x8F1B,lsl #16
- dup CONST2.4s,w2
- mov w2,#0xC1D6
- movk w2,#0xCA62,lsl #16
- dup CONST3.4s,w2
Maybe would be clearer or more efficient to load these from memory? Not sure if there's an nice and consice way to load the four 32-bit values into a 128-bit, and then copy/duplicate them into the four const registers.
- C Load message
- ld1 {MSG0.16b,MSG1.16b,MSG2.16b,MSG3.16b},[INPUT]
- C Reverse for little endian
- rev32 MSG0.16b,MSG0.16b
- rev32 MSG1.16b,MSG1.16b
- rev32 MSG2.16b,MSG2.16b
- rev32 MSG3.16b,MSG3.16b
How does this work on big-endian? The ld1 with .16b is endian-neutral (according to the README), that means we always get the wrong order, and then we do unconditional byteswapping? Maybe add a comment. Not sure if it's worth the effort to make it work differently (ld1 .4w on big-endian)? It's going to be a pretty small fraction of the per-block processing.
Regards, /Niels
On Sun, May 23, 2021 at 10:52 AM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
This patch optimizes SHA1 compress function for arm64 architecture by taking advantage of SHA-1 instructions of Armv8 crypto extension. The SHA-1 instructions: SHA1C: SHA1 hash update (choose) SHA1H: SHA1 fixed rotate SHA1M: SHA1 hash update (majority) SHA1P: SHA1 hash update (parity) SHA1SU0: SHA1 schedule update 0 SHA1SU1: SHA1 schedule update 1
Can you add this brief summary of instructions as a comment in the asm file?
Done! I'll attach a patch at the end of the message that performs slightly better as well.
Algorithm mode Mbyte/s sha1 update 800.80 openssl sha1 update 849.17 hmac-sha1 64 bytes 166.10 hmac-sha1 256 bytes 409.24 hmac-sha1 1024 bytes 636.98 hmac-sha1 4096 bytes 739.20 hmac-sha1 single msg 775.67
Benchmark on gcc117 instance of CFarm before applying the patch:
Algorithm mode Mbyte/s sha1 update 214.16 openssl sha1 update 849.44
Benchmark on gcc117 instance of CFarm after applying the patch: Algorithm mode Mbyte/s sha1 update 795.57 openssl sha1 update 849.25
Great speedup! Any idea why openssl is still slightly faster?
Sure, OpenSSL implementation uses a loop inside SH1 update function which eliminates the constant initialization and state loading/sotring for each block while nettle does that for every block iteration.
+define(`TMP0', `v21') +define(`TMP1', `v22')
Not sure I understand how these are used, but it looks like the TMP variables are used in some way for the message expansion state? E.g., TMP0 assigned in the code for rounds 0-3, and this value used in the code for rounds 8-11. Other implementations don't need extra state for this, but just modifies the 16 message words in-place.
Modifying the message words in-place will change the value used by 'sha1su0' and 'sha1su1' instructions. According to ARM® A64 Instruction Set Architecture: SHA1SU0 <Vd>.4S, <Vn>.4S, <Vm>.4S <Vd> Is the name of the SIMD&FP source and destination register . .
SHA1SU1 <Vd>.4S, <Vn>.4S <Vd> Is the name of the SIMD&FP source and destination register . .
So using TMP variable is necessary here. I can't think of any replacement, let me know how the other implementations handle this case.
It would be nice to either make the TMP registers more temporary (i.e.,
no round depends on the value in these registers from previous rounds) and keep needed state only on the MSG variables. Or rename them to give a better hint on how they're used.
Done! Yield a slight performance increase btw.
+C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)
+PROLOGUE(nettle_sha1_compress)
- C Initialize constants
- mov w2,#0x7999
- movk w2,#0x5A82,lsl #16
- dup CONST0.4s,w2
- mov w2,#0xEBA1
- movk w2,#0x6ED9,lsl #16
- dup CONST1.4s,w2
- mov w2,#0xBCDC
- movk w2,#0x8F1B,lsl #16
- dup CONST2.4s,w2
- mov w2,#0xC1D6
- movk w2,#0xCA62,lsl #16
- dup CONST3.4s,w2
Maybe would be clearer or more efficient to load these from memory? Not sure if there's an nice and consice way to load the four 32-bit values into a 128-bit, and then copy/duplicate them into the four const registers.
We can load all the constants (including duplicate values) from memory with one instruction. The issue is how to get the data address properly for every supported abi! By far I saw solutions with multiple paths for different abi which I don't really like, the easiest solution is to define the data in the .text section to make sure the address is near enough to be loaded with certain instruction. Do you want to do that?
- C Load message
- ld1 {MSG0.16b,MSG1.16b,MSG2.16b,MSG3.16b},[INPUT]
- C Reverse for little endian
- rev32 MSG0.16b,MSG0.16b
- rev32 MSG1.16b,MSG1.16b
- rev32 MSG2.16b,MSG2.16b
- rev32 MSG3.16b,MSG3.16b
How does this work on big-endian? The ld1 with .16b is endian-neutral (according to the README), that means we always get the wrong order, and then we do unconditional byteswapping? Maybe add a comment. Not sure if it's worth the effort to make it work differently (ld1 .4w on big-endian)? It's going to be a pretty small fraction of the per-block processing.
We have an intensive discussion about that in the GCM patch. The short story, this patch should work well for both endianness modes. However, it's not the same way we use in GCM patch to handle the endianness variation, to follow GCM patch way we can do:
C Load message ld1 {MSG0.4s,MSG1.4s,MSG2.4s,MSG3.4s},[INPUT]
C Reverse for little endian IF_LE(` rev32 MSG0.16b,MSG0.16b rev32 MSG1.16b,MSG1.16b rev32 MSG2.16b,MSG2.16b rev32 MSG3.16b,MSG3.16b ')
regards, Mamone
--- arm64/crypto/sha1-compress.asm | 93 +++++++++++++++++++++++------------------- 1 file changed, 50 insertions(+), 43 deletions(-)
diff --git a/arm64/crypto/sha1-compress.asm b/arm64/crypto/sha1-compress.asm index f261c93d..9f7d9f37 100644 --- a/arm64/crypto/sha1-compress.asm +++ b/arm64/crypto/sha1-compress.asm @@ -30,6 +30,15 @@ ifelse(` not, see http://www.gnu.org/licenses/. ')
+C This implementation uses the SHA-1 instructions of Armv8 crypto +C extension. +C SHA1C: SHA1 hash update (choose) +C SHA1H: SHA1 fixed rotate +C SHA1M: SHA1 hash update (majority) +C SHA1P: SHA1 hash update (parity) +C SHA1SU0: SHA1 schedule update 0 +C SHA1SU1: SHA1 schedule update 1 + .file "sha1-compress.asm" .arch armv8-a+crypto
@@ -53,8 +62,7 @@ define(`ABCD_SAVED', `v17') define(`E0', `v18') define(`E0_SAVED', `v19') define(`E1', `v20') -define(`TMP0', `v21') -define(`TMP1', `v22') +define(`TMP', `v21')
C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)
@@ -92,140 +100,139 @@ PROLOGUE(nettle_sha1_compress) rev32 MSG2.16b,MSG2.16b rev32 MSG3.16b,MSG3.16b
- add TMP0.4s,MSG0.4s,CONST0.4s - add TMP1.4s,MSG1.4s,CONST0.4s - C Rounds 0-3 + add TMP.4s,MSG0.4s,CONST0.4s sha1h SFP(E1),SFP(ABCD) - sha1c QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG2.4s,CONST0.4s + sha1c QFP(ABCD),SFP(E0),TMP.4s sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 4-7 + add TMP.4s,MSG1.4s,CONST0.4s sha1h SFP(E0),SFP(ABCD) - sha1c QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG3.4s,CONST0.4s + sha1c QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG0.4s,MSG3.4s sha1su0 MSG1.4s,MSG2.4s,MSG3.4s
C Rounds 8-11 + add TMP.4s,MSG2.4s,CONST0.4s sha1h SFP(E1),SFP(ABCD) - sha1c QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG0.4s,CONST0.4s + sha1c QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG1.4s,MSG0.4s sha1su0 MSG2.4s,MSG3.4s,MSG0.4s
C Rounds 12-15 + add TMP.4s,MSG3.4s,CONST0.4s sha1h SFP(E0),SFP(ABCD) - sha1c QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG1.4s,CONST1.4s + sha1c QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG2.4s,MSG1.4s sha1su0 MSG3.4s,MSG0.4s,MSG1.4s
C Rounds 16-19 + add TMP.4s,MSG0.4s,CONST0.4s sha1h SFP(E1),SFP(ABCD) - sha1c QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG2.4s,CONST1.4s + sha1c QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG3.4s,MSG2.4s sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 20-23 + add TMP.4s,MSG1.4s,CONST1.4s sha1h SFP(E0),SFP(ABCD) - sha1p QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG3.4s,CONST1.4s + sha1p QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG0.4s,MSG3.4s sha1su0 MSG1.4s,MSG2.4s,MSG3.4s
C Rounds 24-27 + add TMP.4s,MSG2.4s,CONST1.4s sha1h SFP(E1),SFP(ABCD) - sha1p QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG0.4s,CONST1.4s + sha1p QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG1.4s,MSG0.4s sha1su0 MSG2.4s,MSG3.4s,MSG0.4s
C Rounds 28-31 + add TMP.4s,MSG3.4s,CONST1.4s sha1h SFP(E0),SFP(ABCD) - sha1p QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG1.4s,CONST1.4s + sha1p QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG2.4s,MSG1.4s sha1su0 MSG3.4s,MSG0.4s,MSG1.4s
C Rounds 32-35 + add TMP.4s,MSG0.4s,CONST1.4s sha1h SFP(E1),SFP(ABCD) - sha1p QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG2.4s,CONST2.4s + sha1p QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG3.4s,MSG2.4s sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 36-39 + add TMP.4s,MSG1.4s,CONST1.4s sha1h SFP(E0),SFP(ABCD) - sha1p QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG3.4s,CONST2.4s + sha1p QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG0.4s,MSG3.4s sha1su0 MSG1.4s,MSG2.4s,MSG3.4s
C Rounds 40-43 + add TMP.4s,MSG2.4s,CONST2.4s sha1h SFP(E1),SFP(ABCD) - sha1m QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG0.4s,CONST2.4s + sha1m QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG1.4s,MSG0.4s sha1su0 MSG2.4s,MSG3.4s,MSG0.4s
C Rounds 44-47 + add TMP.4s,MSG3.4s,CONST2.4s sha1h SFP(E0),SFP(ABCD) - sha1m QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG1.4s,CONST2.4s + sha1m QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG2.4s,MSG1.4s sha1su0 MSG3.4s,MSG0.4s,MSG1.4s
C Rounds 48-51 + add TMP.4s,MSG0.4s,CONST2.4s sha1h SFP(E1),SFP(ABCD) - sha1m QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG2.4s,CONST2.4s + sha1m QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG3.4s,MSG2.4s sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 52-55 + add TMP.4s,MSG1.4s,CONST2.4s sha1h SFP(E0),SFP(ABCD) - sha1m QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG3.4s,CONST3.4s + sha1m QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG0.4s,MSG3.4s sha1su0 MSG1.4s,MSG2.4s,MSG3.4s
C Rounds 56-59 + add TMP.4s,MSG2.4s,CONST2.4s sha1h SFP(E1),SFP(ABCD) - sha1m QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG0.4s,CONST3.4s + sha1m QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG1.4s,MSG0.4s sha1su0 MSG2.4s,MSG3.4s,MSG0.4s
C Rounds 60-63 + add TMP.4s,MSG3.4s,CONST3.4s sha1h SFP(E0),SFP(ABCD) - sha1p QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG1.4s,CONST3.4s + sha1p QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG2.4s,MSG1.4s sha1su0 MSG3.4s,MSG0.4s,MSG1.4s
C Rounds 64-67 + add TMP.4s,MSG0.4s,CONST3.4s sha1h SFP(E1),SFP(ABCD) - sha1p QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG2.4s,CONST3.4s + sha1p QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG3.4s,MSG2.4s sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 68-71 + add TMP.4s,MSG1.4s,CONST3.4s sha1h SFP(E0),SFP(ABCD) - sha1p QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG3.4s,CONST3.4s + sha1p QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG0.4s,MSG3.4s
C Rounds 72-75 + add TMP.4s,MSG2.4s,CONST3.4s sha1h SFP(E1),SFP(ABCD) - sha1p QFP(ABCD),SFP(E0),TMP0.4s + sha1p QFP(ABCD),SFP(E0),TMP.4s
C Rounds 76-79 + add TMP.4s,MSG3.4s,CONST3.4s sha1h SFP(E0),SFP(ABCD) - sha1p QFP(ABCD),SFP(E1),TMP1.4s + sha1p QFP(ABCD),SFP(E1),TMP.4s
C Combine state add E0.4s,E0.4s,E0_SAVED.4s
Maamoun TK maamoun.tk@googlemail.com writes:
Great speedup! Any idea why openssl is still slightly faster?
Sure, OpenSSL implementation uses a loop inside SH1 update function which eliminates the constant initialization and state loading/sotring for each block while nettle does that for every block iteration.
I see, that can make a difference if the actual compressing is fast enough.
Modifying the message words in-place will change the value used by 'sha1su0' and 'sha1su1' instructions. According to ARM® A64 Instruction Set Architecture: SHA1SU0 <Vd>.4S, <Vn>.4S, <Vm>.4S <Vd> Is the name of the SIMD&FP source and destination register . .
SHA1SU1 <Vd>.4S, <Vn>.4S <Vd> Is the name of the SIMD&FP source and destination register . .
So using TMP variable is necessary here. I can't think of any replacement, let me know how the other implementations handle this case.
I'm afraid I have no concrete suggestion, I would need to read up on the aarch64 instructions. Implementations that do only a single round at a time (e.g., the C implementation) uses a 16-word circular buffer for the message expansion state, and updates one of the words per round. If I read the latest patch correctly, you also don't keep any state besides the MSGx registers?
It would be nice to either make the TMP registers more temporary (i.e.,
no round depends on the value in these registers from previous rounds) and keep needed state only on the MSG variables. Or rename them to give a better hint on how they're used.
Done! Yield a slight performance increase btw.
Nice.
We can load all the constants (including duplicate values) from memory with one instruction. The issue is how to get the data address properly for every supported abi!
the easiest solution is to define the data in the .text section to make sure the address is near enough to be loaded with certain instruction. Do you want to do that?
Using .text would probably work, even if it's in some sense more correct to put the constants in rodata segment. But let's leave as is for now.
We have an intensive discussion about that in the GCM patch. The short story, this patch should work well for both endianness modes.
Sounds good.
I've pushed the combined patches to a branch arm64-sha1. Would you like to update the fat build setup, before merging to master?
Regards, /Niels
On Tue, Jun 1, 2021 at 8:02 PM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes: If I read the latest patch correctly, you also don't keep any state besides the MSGx registers?
Right, everything is done in the same context of each round in the latest patch, nothing kept beyond or after.
the easiest solution is to define the data in the .text section to make sure the address is near enough to
be
loaded with certain instruction. Do you want to do that?
Using .text would probably work, even if it's in some sense more correct to put the constants in rodata segment. But let's leave as is for now.
I agree, it's acceptable to keep it as is for this case. I'm a little concerned about handling the constant initialization of more complicated cases, we're gonna discuss it at the time.
I've pushed the combined patches to a branch arm64-sha1. Would you like to update the fat build setup, before merging to master?
Sure, I just need some time as I have some stuff to sort out before doing the fat build for this patch.
regards, Mamone
nettle-bugs@lists.lysator.liu.se