Maamoun TK maamoun.tk@googlemail.com writes:
Looks pretty good. A few comments and questions below.
This patch optimizes SHA1 compress function for arm64 architecture by taking advantage of SHA-1 instructions of Armv8 crypto extension. The SHA-1 instructions: SHA1C: SHA1 hash update (choose) SHA1H: SHA1 fixed rotate SHA1M: SHA1 hash update (majority) SHA1P: SHA1 hash update (parity) SHA1SU0: SHA1 schedule update 0 SHA1SU1: SHA1 schedule update 1
Can you add this brief summary of instructions as a comment in the asm file?
Benchmark on gcc117 instance of CFarm before applying the patch: Algorithm mode Mbyte/s sha1 update 214.16 openssl sha1 update 849.44
Benchmark on gcc117 instance of CFarm after applying the patch: Algorithm mode Mbyte/s sha1 update 795.57 openssl sha1 update 849.25
Great speedup! Any idea why openssl is still slightly faster?
+define(`TMP0', `v21') +define(`TMP1', `v22')
Not sure I understand how these are used, but it looks like the TMP variables are used in some way for the message expansion state? E.g., TMP0 assigned in the code for rounds 0-3, and this value used in the code for rounds 8-11. Other implementations don't need extra state for this, but just modifies the 16 message words in-place.
It would be nice to either make the TMP registers more temporary (i.e., no round depends on the value in these registers from previous rounds) and keep needed state only on the MSG variables. Or rename them to give a better hint on how they're used.
+C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)
+PROLOGUE(nettle_sha1_compress)
- C Initialize constants
- mov w2,#0x7999
- movk w2,#0x5A82,lsl #16
- dup CONST0.4s,w2
- mov w2,#0xEBA1
- movk w2,#0x6ED9,lsl #16
- dup CONST1.4s,w2
- mov w2,#0xBCDC
- movk w2,#0x8F1B,lsl #16
- dup CONST2.4s,w2
- mov w2,#0xC1D6
- movk w2,#0xCA62,lsl #16
- dup CONST3.4s,w2
Maybe would be clearer or more efficient to load these from memory? Not sure if there's an nice and consice way to load the four 32-bit values into a 128-bit, and then copy/duplicate them into the four const registers.
- C Load message
- ld1 {MSG0.16b,MSG1.16b,MSG2.16b,MSG3.16b},[INPUT]
- C Reverse for little endian
- rev32 MSG0.16b,MSG0.16b
- rev32 MSG1.16b,MSG1.16b
- rev32 MSG2.16b,MSG2.16b
- rev32 MSG3.16b,MSG3.16b
How does this work on big-endian? The ld1 with .16b is endian-neutral (according to the README), that means we always get the wrong order, and then we do unconditional byteswapping? Maybe add a comment. Not sure if it's worth the effort to make it work differently (ld1 .4w on big-endian)? It's going to be a pretty small fraction of the per-block processing.
Regards, /Niels