Re: [S390x] Optimize SHA1 with fat build support

20 Sep 2021


      On Sun, Aug 29, 2021 at 5:52 PM Maamoun TK maamoun.tk@googlemail.com
wrote:
...
Applying hardware-accelerated SHA3 instruction to optimize sha3_permute
function for s390x arch has an insignificant impact on the performance, I'm
wondering what we can do to take full advantage of those instructions.
Optimizing sha3_absorb seems a good way to go since the s390x-specific
accelerator implies permuting of state bytes and XOR operations but the
downside of implementing this function is handling the block size variants
for each mode, S390x arch supports the standard block sizes so we can
branch for each standard size in the supported modes but should we consider
unexpected block size during the implementation?
I got almost 12% speedup of optimizing the sha3_permute() function using
the SHA hardware accelerator of s390x, is it worth adding that assembly
implementation? I'll attach the patch at the end of this email.
In another topic, are you aware of any CFarm alternative that have arm64
machine with SHA-256 and SHA3 support to continue optimizing those
functions for aarch64 architecture in addition to x86_64 machine with shani
support to complete the patch of sha1_comoress_n() function and maximize
the performance of SHA1 compress function on hardware-supported
architectures.
C s390x/msa_x6/sha3-permute.asm
ifelse(`
   Copyright (C) 2021 Mamone Tarsha
   This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or
   modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free
       Software Foundation; either version 3 of the License, or (at your
       option) any later version.
or
* the GNU General Public License as published by the Free
       Software Foundation; either version 2 of the License, or (at your
       option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful,
   but WITHOUT ANY WARRANTY; without even the implied warranty of
   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
   General Public License for more details.
You should have received copies of the GNU General Public License and
   the GNU Lesser General Public License along with this program.  If
   not, see http://www.gnu.org/licenses/.
')
C KIMD (COMPUTE INTERMEDIATE MESSAGE DIGEST) is specefied in
C "z/Architecture Principles of Operation SA22-7832-12" as follows:
C A function specified by the function code in general register 0 is
performed.
C General register 1 contains the logical address of the leftmost byte of
the parameter block in storage.
C the second operand is processed as specified by the function code using
an initial chaining value in
C the parameter block, and the result replaces the chaining value.
C This implementation uses KIMD-SHA3-512 function.
C The parameter block used for the KIMD-SHA3-512 function has the following
format:
C *----------------------------------------------*
C |               ICV (200 bytes)                |
C *----------------------------------------------*
C SHA function code
define(`SHA3_512_FUNCTION_CODE', `35')
C Size of block
define(`SHA3_512_BLOCK_SIZE', `72')
C Size of state
define(`SHA3_STATE_SIZE', `200')
.file "sha3-permute.asm"
.text
C void
C sha3_permute(struct sha3_ctx *ctx)
PROLOGUE(nettle_sha3_permute)
    lghi           %r0,SHA3_512_FUNCTION_CODE    C FUNCTION_CODE
    ALLOC_STACK(%r1,SHA3_STATE_SIZE+SHA3_512_BLOCK_SIZE)
.irp idx, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24
    mvcin          \idx*8(8,%r1),\idx*8+7(%r2)
.endr
    la             %r4,SHA3_STATE_SIZE (%r1)
    xc             0(SHA3_512_BLOCK_SIZE,%r4),0(%r4)
    lghi           %r5,SHA3_512_BLOCK_SIZE
1:  .long   0xb93e0004                           C kimd %r0,%r4. perform
KIMD-SHA operation on data
    brc            1,1b
.irp idx, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24
    mvcin          \idx*8(8,%r2),\idx*8+7(%r1)
.endr
    FREE_STACK(SHA3_STATE_SIZE+SHA3_512_BLOCK_SIZE)
    br             RA
EPILOGUE(nettle_sha3_permute)
regards,
Mamone

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [S390x] Optimize SHA1 with fat build support