Re: [Aarch64] Optimize SHA1 Compress

23 May 2021

      Maamoun TK maamoun.tk@googlemail.com writes:
Looks pretty good. A few comments and questions below.
...
This patch optimizes SHA1 compress function for arm64 architecture by
taking advantage of SHA-1 instructions of Armv8 crypto extension.
The SHA-1 instructions:
SHA1C: SHA1 hash update (choose)
SHA1H: SHA1 fixed rotate
SHA1M: SHA1 hash update (majority)
SHA1P: SHA1 hash update (parity)
SHA1SU0: SHA1 schedule update 0
SHA1SU1: SHA1 schedule update 1
Can you add this brief summary of instructions as a comment in the asm
file?
...
Benchmark on gcc117 instance of CFarm before applying the patch:
         Algorithm         mode        Mbyte/s
         sha1               update       214.16
         openssl sha1  update       849.44
...
Benchmark on gcc117 instance of CFarm after applying the patch:
         Algorithm         mode        Mbyte/s
         sha1                update       795.57
         openssl sha1   update       849.25
Great speedup! Any idea why openssl is still slightly faster?
...
+define(`TMP0', `v21')
+define(`TMP1', `v22')
Not sure I understand how these are used, but it looks like the TMP
variables are used in some way for the message expansion state? E.g.,
TMP0 assigned in the code for rounds 0-3, and this value used in the
code for rounds 8-11. Other implementations don't need extra state for
this, but just modifies the 16 message words in-place.
It would be nice to either make the TMP registers more temporary (i.e.,
no round depends on the value in these registers from previous rounds)
and keep needed state only on the MSG variables. Or rename them to give
a better hint on how they're used.
...
+C void nettle_sha1_compress(uint32_t *state, const uint8_t *input)

+PROLOGUE(nettle_sha1_compress)

C Initialize constants
mov            w2,#0x7999
movk           w2,#0x5A82,lsl #16
dup            CONST0.4s,w2
mov            w2,#0xEBA1
movk           w2,#0x6ED9,lsl #16
dup            CONST1.4s,w2
mov            w2,#0xBCDC
movk           w2,#0x8F1B,lsl #16
dup            CONST2.4s,w2
mov            w2,#0xC1D6
movk           w2,#0xCA62,lsl #16
dup            CONST3.4s,w2

Maybe would be clearer or more efficient to load these from memory? Not
sure if there's an nice and consice way to load the four 32-bit values
into a 128-bit, and then copy/duplicate them into the four const
registers.
...

C Load message
ld1            {MSG0.16b,MSG1.16b,MSG2.16b,MSG3.16b},[INPUT]

C Reverse for little endian
rev32          MSG0.16b,MSG0.16b
rev32          MSG1.16b,MSG1.16b
rev32          MSG2.16b,MSG2.16b
rev32          MSG3.16b,MSG3.16b

How does this work on big-endian? The ld1 with .16b is endian-neutral
(according to the README), that means we always get the wrong order, and
then we do unconditional byteswapping? Maybe add a comment. Not sure if
it's worth the effort to make it work differently (ld1 .4w on
big-endian)? It's going to be a pretty small fraction of the per-block
processing.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Aarch64] Optimize SHA1 Compress