On Sun, May 23, 2021 at 10:52 AM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
This patch optimizes SHA1 compress function for arm64 architecture by taking advantage of SHA-1 instructions of Armv8 crypto extension. The SHA-1 instructions: SHA1C: SHA1 hash update (choose) SHA1H: SHA1 fixed rotate SHA1M: SHA1 hash update (majority) SHA1P: SHA1 hash update (parity) SHA1SU0: SHA1 schedule update 0 SHA1SU1: SHA1 schedule update 1
Can you add this brief summary of instructions as a comment in the asm file?
Done! I'll attach a patch at the end of the message that performs slightly better as well.
Algorithm mode Mbyte/s sha1 update 800.80 openssl sha1 update 849.17 hmac-sha1 64 bytes 166.10 hmac-sha1 256 bytes 409.24 hmac-sha1 1024 bytes 636.98 hmac-sha1 4096 bytes 739.20 hmac-sha1 single msg 775.67
Benchmark on gcc117 instance of CFarm before applying the patch:
Algorithm mode Mbyte/s sha1 update 214.16 openssl sha1 update 849.44
Benchmark on gcc117 instance of CFarm after applying the patch: Algorithm mode Mbyte/s sha1 update 795.57 openssl sha1 update 849.25
Great speedup! Any idea why openssl is still slightly faster?
Sure, OpenSSL implementation uses a loop inside SH1 update function which eliminates the constant initialization and state loading/sotring for each block while nettle does that for every block iteration.
+define(`TMP0', `v21') +define(`TMP1', `v22')
Not sure I understand how these are used, but it looks like the TMP variables are used in some way for the message expansion state? E.g., TMP0 assigned in the code for rounds 0-3, and this value used in the code for rounds 8-11. Other implementations don't need extra state for this, but just modifies the 16 message words in-place.
Modifying the message words in-place will change the value used by 'sha1su0' and 'sha1su1' instructions. According to ARM® A64 Instruction Set Architecture: SHA1SU0 <Vd>.4S, <Vn>.4S, <Vm>.4S <Vd> Is the name of the SIMD&FP source and destination register . .
SHA1SU1 <Vd>.4S, <Vn>.4S <Vd> Is the name of the SIMD&FP source and destination register . .
So using TMP variable is necessary here. I can't think of any replacement, let me know how the other implementations handle this case.
It would be nice to either make the TMP registers more temporary (i.e.,
no round depends on the value in these registers from previous rounds) and keep needed state only on the MSG variables. Or rename them to give a better hint on how they're used.
Done! Yield a slight performance increase btw.
+C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)
+PROLOGUE(nettle_sha1_compress)
- C Initialize constants
- mov w2,#0x7999
- movk w2,#0x5A82,lsl #16
- dup CONST0.4s,w2
- mov w2,#0xEBA1
- movk w2,#0x6ED9,lsl #16
- dup CONST1.4s,w2
- mov w2,#0xBCDC
- movk w2,#0x8F1B,lsl #16
- dup CONST2.4s,w2
- mov w2,#0xC1D6
- movk w2,#0xCA62,lsl #16
- dup CONST3.4s,w2
Maybe would be clearer or more efficient to load these from memory? Not sure if there's an nice and consice way to load the four 32-bit values into a 128-bit, and then copy/duplicate them into the four const registers.
We can load all the constants (including duplicate values) from memory with one instruction. The issue is how to get the data address properly for every supported abi! By far I saw solutions with multiple paths for different abi which I don't really like, the easiest solution is to define the data in the .text section to make sure the address is near enough to be loaded with certain instruction. Do you want to do that?
- C Load message
- ld1 {MSG0.16b,MSG1.16b,MSG2.16b,MSG3.16b},[INPUT]
- C Reverse for little endian
- rev32 MSG0.16b,MSG0.16b
- rev32 MSG1.16b,MSG1.16b
- rev32 MSG2.16b,MSG2.16b
- rev32 MSG3.16b,MSG3.16b
How does this work on big-endian? The ld1 with .16b is endian-neutral (according to the README), that means we always get the wrong order, and then we do unconditional byteswapping? Maybe add a comment. Not sure if it's worth the effort to make it work differently (ld1 .4w on big-endian)? It's going to be a pretty small fraction of the per-block processing.
We have an intensive discussion about that in the GCM patch. The short story, this patch should work well for both endianness modes. However, it's not the same way we use in GCM patch to handle the endianness variation, to follow GCM patch way we can do:
C Load message ld1 {MSG0.4s,MSG1.4s,MSG2.4s,MSG3.4s},[INPUT]
C Reverse for little endian IF_LE(` rev32 MSG0.16b,MSG0.16b rev32 MSG1.16b,MSG1.16b rev32 MSG2.16b,MSG2.16b rev32 MSG3.16b,MSG3.16b ')
regards, Mamone
--- arm64/crypto/sha1-compress.asm | 93 +++++++++++++++++++++++------------------- 1 file changed, 50 insertions(+), 43 deletions(-)
diff --git a/arm64/crypto/sha1-compress.asm b/arm64/crypto/sha1-compress.asm index f261c93d..9f7d9f37 100644 --- a/arm64/crypto/sha1-compress.asm +++ b/arm64/crypto/sha1-compress.asm @@ -30,6 +30,15 @@ ifelse(` not, see http://www.gnu.org/licenses/. ')
+C This implementation uses the SHA-1 instructions of Armv8 crypto +C extension. +C SHA1C: SHA1 hash update (choose) +C SHA1H: SHA1 fixed rotate +C SHA1M: SHA1 hash update (majority) +C SHA1P: SHA1 hash update (parity) +C SHA1SU0: SHA1 schedule update 0 +C SHA1SU1: SHA1 schedule update 1 + .file "sha1-compress.asm" .arch armv8-a+crypto
@@ -53,8 +62,7 @@ define(`ABCD_SAVED', `v17') define(`E0', `v18') define(`E0_SAVED', `v19') define(`E1', `v20') -define(`TMP0', `v21') -define(`TMP1', `v22') +define(`TMP', `v21')
C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)
@@ -92,140 +100,139 @@ PROLOGUE(nettle_sha1_compress) rev32 MSG2.16b,MSG2.16b rev32 MSG3.16b,MSG3.16b
- add TMP0.4s,MSG0.4s,CONST0.4s - add TMP1.4s,MSG1.4s,CONST0.4s - C Rounds 0-3 + add TMP.4s,MSG0.4s,CONST0.4s sha1h SFP(E1),SFP(ABCD) - sha1c QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG2.4s,CONST0.4s + sha1c QFP(ABCD),SFP(E0),TMP.4s sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 4-7 + add TMP.4s,MSG1.4s,CONST0.4s sha1h SFP(E0),SFP(ABCD) - sha1c QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG3.4s,CONST0.4s + sha1c QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG0.4s,MSG3.4s sha1su0 MSG1.4s,MSG2.4s,MSG3.4s
C Rounds 8-11 + add TMP.4s,MSG2.4s,CONST0.4s sha1h SFP(E1),SFP(ABCD) - sha1c QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG0.4s,CONST0.4s + sha1c QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG1.4s,MSG0.4s sha1su0 MSG2.4s,MSG3.4s,MSG0.4s
C Rounds 12-15 + add TMP.4s,MSG3.4s,CONST0.4s sha1h SFP(E0),SFP(ABCD) - sha1c QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG1.4s,CONST1.4s + sha1c QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG2.4s,MSG1.4s sha1su0 MSG3.4s,MSG0.4s,MSG1.4s
C Rounds 16-19 + add TMP.4s,MSG0.4s,CONST0.4s sha1h SFP(E1),SFP(ABCD) - sha1c QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG2.4s,CONST1.4s + sha1c QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG3.4s,MSG2.4s sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 20-23 + add TMP.4s,MSG1.4s,CONST1.4s sha1h SFP(E0),SFP(ABCD) - sha1p QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG3.4s,CONST1.4s + sha1p QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG0.4s,MSG3.4s sha1su0 MSG1.4s,MSG2.4s,MSG3.4s
C Rounds 24-27 + add TMP.4s,MSG2.4s,CONST1.4s sha1h SFP(E1),SFP(ABCD) - sha1p QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG0.4s,CONST1.4s + sha1p QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG1.4s,MSG0.4s sha1su0 MSG2.4s,MSG3.4s,MSG0.4s
C Rounds 28-31 + add TMP.4s,MSG3.4s,CONST1.4s sha1h SFP(E0),SFP(ABCD) - sha1p QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG1.4s,CONST1.4s + sha1p QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG2.4s,MSG1.4s sha1su0 MSG3.4s,MSG0.4s,MSG1.4s
C Rounds 32-35 + add TMP.4s,MSG0.4s,CONST1.4s sha1h SFP(E1),SFP(ABCD) - sha1p QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG2.4s,CONST2.4s + sha1p QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG3.4s,MSG2.4s sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 36-39 + add TMP.4s,MSG1.4s,CONST1.4s sha1h SFP(E0),SFP(ABCD) - sha1p QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG3.4s,CONST2.4s + sha1p QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG0.4s,MSG3.4s sha1su0 MSG1.4s,MSG2.4s,MSG3.4s
C Rounds 40-43 + add TMP.4s,MSG2.4s,CONST2.4s sha1h SFP(E1),SFP(ABCD) - sha1m QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG0.4s,CONST2.4s + sha1m QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG1.4s,MSG0.4s sha1su0 MSG2.4s,MSG3.4s,MSG0.4s
C Rounds 44-47 + add TMP.4s,MSG3.4s,CONST2.4s sha1h SFP(E0),SFP(ABCD) - sha1m QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG1.4s,CONST2.4s + sha1m QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG2.4s,MSG1.4s sha1su0 MSG3.4s,MSG0.4s,MSG1.4s
C Rounds 48-51 + add TMP.4s,MSG0.4s,CONST2.4s sha1h SFP(E1),SFP(ABCD) - sha1m QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG2.4s,CONST2.4s + sha1m QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG3.4s,MSG2.4s sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 52-55 + add TMP.4s,MSG1.4s,CONST2.4s sha1h SFP(E0),SFP(ABCD) - sha1m QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG3.4s,CONST3.4s + sha1m QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG0.4s,MSG3.4s sha1su0 MSG1.4s,MSG2.4s,MSG3.4s
C Rounds 56-59 + add TMP.4s,MSG2.4s,CONST2.4s sha1h SFP(E1),SFP(ABCD) - sha1m QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG0.4s,CONST3.4s + sha1m QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG1.4s,MSG0.4s sha1su0 MSG2.4s,MSG3.4s,MSG0.4s
C Rounds 60-63 + add TMP.4s,MSG3.4s,CONST3.4s sha1h SFP(E0),SFP(ABCD) - sha1p QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG1.4s,CONST3.4s + sha1p QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG2.4s,MSG1.4s sha1su0 MSG3.4s,MSG0.4s,MSG1.4s
C Rounds 64-67 + add TMP.4s,MSG0.4s,CONST3.4s sha1h SFP(E1),SFP(ABCD) - sha1p QFP(ABCD),SFP(E0),TMP0.4s - add TMP0.4s,MSG2.4s,CONST3.4s + sha1p QFP(ABCD),SFP(E0),TMP.4s sha1su1 MSG3.4s,MSG2.4s sha1su0 MSG0.4s,MSG1.4s,MSG2.4s
C Rounds 68-71 + add TMP.4s,MSG1.4s,CONST3.4s sha1h SFP(E0),SFP(ABCD) - sha1p QFP(ABCD),SFP(E1),TMP1.4s - add TMP1.4s,MSG3.4s,CONST3.4s + sha1p QFP(ABCD),SFP(E1),TMP.4s sha1su1 MSG0.4s,MSG3.4s
C Rounds 72-75 + add TMP.4s,MSG2.4s,CONST3.4s sha1h SFP(E1),SFP(ABCD) - sha1p QFP(ABCD),SFP(E0),TMP0.4s + sha1p QFP(ABCD),SFP(E0),TMP.4s
C Rounds 76-79 + add TMP.4s,MSG3.4s,CONST3.4s sha1h SFP(E0),SFP(ABCD) - sha1p QFP(ABCD),SFP(E1),TMP1.4s + sha1p QFP(ABCD),SFP(E1),TMP.4s
C Combine state add E0.4s,E0.4s,E0_SAVED.4s