On Sun, Aug 29, 2021 at 5:52 PM Maamoun TK maamoun.tk@googlemail.com wrote:
Applying hardware-accelerated SHA3 instruction to optimize sha3_permute function for s390x arch has an insignificant impact on the performance, I'm wondering what we can do to take full advantage of those instructions. Optimizing sha3_absorb seems a good way to go since the s390x-specific accelerator implies permuting of state bytes and XOR operations but the downside of implementing this function is handling the block size variants for each mode, S390x arch supports the standard block sizes so we can branch for each standard size in the supported modes but should we consider unexpected block size during the implementation?
I got almost 12% speedup of optimizing the sha3_permute() function using the SHA hardware accelerator of s390x, is it worth adding that assembly implementation? I'll attach the patch at the end of this email.
In another topic, are you aware of any CFarm alternative that have arm64 machine with SHA-256 and SHA3 support to continue optimizing those functions for aarch64 architecture in addition to x86_64 machine with shani support to complete the patch of sha1_comoress_n() function and maximize the performance of SHA1 compress function on hardware-supported architectures.
C s390x/msa_x6/sha3-permute.asm
ifelse(` Copyright (C) 2021 Mamone Tarsha This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
C KIMD (COMPUTE INTERMEDIATE MESSAGE DIGEST) is specefied in C "z/Architecture Principles of Operation SA22-7832-12" as follows: C A function specified by the function code in general register 0 is performed. C General register 1 contains the logical address of the leftmost byte of the parameter block in storage. C the second operand is processed as specified by the function code using an initial chaining value in C the parameter block, and the result replaces the chaining value.
C This implementation uses KIMD-SHA3-512 function. C The parameter block used for the KIMD-SHA3-512 function has the following format: C *----------------------------------------------* C | ICV (200 bytes) | C *----------------------------------------------*
C SHA function code define(`SHA3_512_FUNCTION_CODE', `35') C Size of block define(`SHA3_512_BLOCK_SIZE', `72') C Size of state define(`SHA3_STATE_SIZE', `200')
.file "sha3-permute.asm"
.text
C void C sha3_permute(struct sha3_ctx *ctx)
PROLOGUE(nettle_sha3_permute) lghi %r0,SHA3_512_FUNCTION_CODE C FUNCTION_CODE ALLOC_STACK(%r1,SHA3_STATE_SIZE+SHA3_512_BLOCK_SIZE) .irp idx, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 mvcin \idx*8(8,%r1),\idx*8+7(%r2) .endr la %r4,SHA3_STATE_SIZE (%r1) xc 0(SHA3_512_BLOCK_SIZE,%r4),0(%r4) lghi %r5,SHA3_512_BLOCK_SIZE 1: .long 0xb93e0004 C kimd %r0,%r4. perform KIMD-SHA operation on data brc 1,1b .irp idx, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 mvcin \idx*8(8,%r2),\idx*8+7(%r1) .endr FREE_STACK(SHA3_STATE_SIZE+SHA3_512_BLOCK_SIZE) br RA EPILOGUE(nettle_sha3_permute)
regards, Mamone