I made a merge request in the main repository that optimizes SHA1 for s390x architecture with fat build support !33 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
I made a merge request in the main repository that optimizes SHA1 for s390x architecture with fat build support !33 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33.
Regarding the discussion on https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33#note_10005: It seems the sha1 instructions on s390x are fast enough that the overhead of loading constants, and loading and storing the state, all per block, is a significant cost.
I think it makes sense to change the internal convention for _sha1_compress so that it can do multiple blocks. There are currently 5 assembly implementations that would need updating: arm/v6, arm64/crypto, x86, x86_64 and x86_64/sha_ni. And the C implementation, of course.
If it turns out to be too large a change to do them all at once, one could introduce some new _sha1_compress_n function or the like, and use when available. Actually, we probably need to do that anyway, since for historical reasons, _nettle_sha1_compress is a public function, and needs to be kept (as just a simple C wrapper) for backwards compatibility. Changing it incrementally should be doable but a bit hairy.
There are some other similar compression functions with assembly implementation, for md5, sha256 and sha512. But there's no need to change them all at the same time, or at all.
Regarding the MD_UPDATE macro, that one is defined in the public header file macros.h (which in retrospect was a mistake). So it's probably best to leave it unchanged. New macros for the new convention should be put into some internal header, e.g., md-internal.h.
Regards, /Niels
On Tue, Aug 10, 2021 at 11:55 PM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
I made a merge request in the main repository that optimizes SHA1 for
s390x
architecture with fat build support !33 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33.
Regarding the discussion on https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33#note_10005: It seems the sha1 instructions on s390x are fast enough that the overhead of loading constants, and loading and storing the state, all per block, is a significant cost.
I think it makes sense to change the internal convention for _sha1_compress so that it can do multiple blocks. There are currently 5 assembly implementations that would need updating: arm/v6, arm64/crypto, x86, x86_64 and x86_64/sha_ni. And the C implementation, of course.
If it turns out to be too large a change to do them all at once, one could introduce some new _sha1_compress_n function or the like, and use when available. Actually, we probably need to do that anyway, since for historical reasons, _nettle_sha1_compress is a public function, and needs to be kept (as just a simple C wrapper) for backwards compatibility. Changing it incrementally should be doable but a bit hairy.
There are some other similar compression functions with assembly implementation, for md5, sha256 and sha512. But there's no need to change them all at the same time, or at all.
Regarding the MD_UPDATE macro, that one is defined in the public header file macros.h (which in retrospect was a mistake). So it's probably best to leave it unchanged. New macros for the new convention should be put into some internal header, e.g., md-internal.h.
I've initialized a support of sha1_compress_n function in this branch https://git.lysator.liu.se/mamonet/nettle/-/tree/sha1-compress-n The function works and performs as exprected, I also adapted sha1_compress of s390x and arm64 with the new compress function. Predictably, SHA1 update is now equally performing with the OpenSSL function on arm64 architecture. Benchmark of executing examples/nettle-benchmark on arm64: Algorithm mode Mbyte/s sha1 update 849.82 openssl sha1 update 849.73 Benchmark of executing examples/nettle-benchmark on s390x: Algorithm mode Mbyte/s sha1 update 1791.25 The s390x performance of the new compress function now doubles the speed of the single block optimized function using built-in SHA1 accelerator. Yet, there are implementations of x86, x86_64, and arm architectures to adapt with the new compress function, and the patch may have potential for further improvements in terms of naming convention and documentation.
regards, Mamone
On Thu, Aug 12, 2021 at 4:26 PM Maamoun TK maamoun.tk@googlemail.com wrote:
On Tue, Aug 10, 2021 at 11:55 PM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
I made a merge request in the main repository that optimizes SHA1 for
s390x
architecture with fat build support !33 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33.
Regarding the discussion on https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33#note_10005: It seems the sha1 instructions on s390x are fast enough that the overhead of loading constants, and loading and storing the state, all per block, is a significant cost.
I think it makes sense to change the internal convention for _sha1_compress so that it can do multiple blocks. There are currently 5 assembly implementations that would need updating: arm/v6, arm64/crypto, x86, x86_64 and x86_64/sha_ni. And the C implementation, of course.
If it turns out to be too large a change to do them all at once, one could introduce some new _sha1_compress_n function or the like, and use when available. Actually, we probably need to do that anyway, since for historical reasons, _nettle_sha1_compress is a public function, and needs to be kept (as just a simple C wrapper) for backwards compatibility. Changing it incrementally should be doable but a bit hairy.
There are some other similar compression functions with assembly implementation, for md5, sha256 and sha512. But there's no need to change them all at the same time, or at all.
Regarding the MD_UPDATE macro, that one is defined in the public header file macros.h (which in retrospect was a mistake). So it's probably best to leave it unchanged. New macros for the new convention should be put into some internal header, e.g., md-internal.h.
Yet, there are implementations of x86, x86_64, and arm architectures to adapt with the new compress function
Modified basic x86_64 implementation to sha1_compress_n function in the same branch. Unfortunately, my x86_64 CPU doesn't support SHA extension so I'm trying to figure out a simple way to test the hardware-accelerated implementation.
regards, Mamone
What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n function for that particular type?
regards, Mamone
On Sun, Aug 15, 2021 at 2:10 AM Maamoun TK maamoun.tk@googlemail.com wrote:
On Thu, Aug 12, 2021 at 4:26 PM Maamoun TK maamoun.tk@googlemail.com wrote:
On Tue, Aug 10, 2021 at 11:55 PM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
I made a merge request in the main repository that optimizes SHA1 for
s390x
architecture with fat build support !33 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33.
Regarding the discussion on https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33#note_10005: It seems the sha1 instructions on s390x are fast enough that the overhead of loading constants, and loading and storing the state, all per block, is a significant cost.
I think it makes sense to change the internal convention for _sha1_compress so that it can do multiple blocks. There are currently 5 assembly implementations that would need updating: arm/v6, arm64/crypto, x86, x86_64 and x86_64/sha_ni. And the C implementation, of course.
If it turns out to be too large a change to do them all at once, one could introduce some new _sha1_compress_n function or the like, and use when available. Actually, we probably need to do that anyway, since for historical reasons, _nettle_sha1_compress is a public function, and needs to be kept (as just a simple C wrapper) for backwards compatibility. Changing it incrementally should be doable but a bit hairy.
There are some other similar compression functions with assembly implementation, for md5, sha256 and sha512. But there's no need to change them all at the same time, or at all.
Regarding the MD_UPDATE macro, that one is defined in the public header file macros.h (which in retrospect was a mistake). So it's probably best to leave it unchanged. New macros for the new convention should be put into some internal header, e.g., md-internal.h.
Yet, there are implementations of x86, x86_64, and arm architectures to adapt with the new compress function
Modified basic x86_64 implementation to sha1_compress_n function in the same branch. Unfortunately, my x86_64 CPU doesn't support SHA extension so I'm trying to figure out a simple way to test the hardware-accelerated implementation.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n function for that particular type?
That's an input file for an obscure "loop mixer" tool, IIRC, it was written mainly by David Harvey for use with GMP loops. This tool tries permuting the instructions of an assembly loop, taking dependencies into account, benchmarks each variant, and tries to find the fastest instruction sequence. It seems I tried this toool on x86 sha1_compress back in 2009, on an AMD K7, and it gave a 17% speedup at the time, according to commit message for 1e757582ac7f8465b213d9761e17c33bd21ca686.
So you can just ignore this file. And you may want to look at the more readable version of x86/sha1_compress.asm, just before that commit.
Regards, /Niels
On Thu, Aug 19, 2021 at 8:48 AM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n function for that particular type?
That's an input file for an obscure "loop mixer" tool, IIRC, it was written mainly by David Harvey for use with GMP loops. This tool tries permuting the instructions of an assembly loop, taking dependencies into account, benchmarks each variant, and tries to find the fastest instruction sequence. It seems I tried this toool on x86 sha1_compress back in 2009, on an AMD K7, and it gave a 17% speedup at the time, according to commit message for 1e757582ac7f8465b213d9761e17c33bd21ca686.
So you can just ignore this file. And you may want to look at the more readable version of x86/sha1_compress.asm, just before that commit.
Thanks, I left the nlms files as are and modified x86/sha1_compress.asm to work with the sha1_compress_n function. I've kept the function parameters in the stack since the instructions are able to execute on memory operands and x86 calling convention passes the parameters through the stack, I'm not sure if those parameters are read-only or can be adjustable, TBH I haven't run into x86 32-bit code for 8 years. What I did is reserving fields in the stack for two parameters and adjusting both values in the new locations to keep the original values unmodified.
regards, Mamone
I added support for the sha1_compress_n function on arm architecture in the same branch https://git.lysator.liu.se/mamonet/nettle/-/tree/sha1-compress-n
regards, Mamone
On Sat, Aug 21, 2021 at 5:22 AM Maamoun TK maamoun.tk@googlemail.com wrote:
On Thu, Aug 19, 2021 at 8:48 AM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n function for that particular type?
That's an input file for an obscure "loop mixer" tool, IIRC, it was written mainly by David Harvey for use with GMP loops. This tool tries permuting the instructions of an assembly loop, taking dependencies into account, benchmarks each variant, and tries to find the fastest instruction sequence. It seems I tried this toool on x86 sha1_compress back in 2009, on an AMD K7, and it gave a 17% speedup at the time, according to commit message for 1e757582ac7f8465b213d9761e17c33bd21ca686.
So you can just ignore this file. And you may want to look at the more readable version of x86/sha1_compress.asm, just before that commit.
Thanks, I left the nlms files as are and modified x86/sha1_compress.asm to work with the sha1_compress_n function. I've kept the function parameters in the stack since the instructions are able to execute on memory operands and x86 calling convention passes the parameters through the stack, I'm not sure if those parameters are read-only or can be adjustable, TBH I haven't run into x86 32-bit code for 8 years. What I did is reserving fields in the stack for two parameters and adjusting both values in the new locations to keep the original values unmodified.
regards, Mamone
Applying hardware-accelerated SHA3 instruction to optimize sha3_permute function for s390x arch has an insignificant impact on the performance, I'm wondering what we can do to take full advantage of those instructions. Optimizing sha3_absorb seems a good way to go since the s390x-specific accelerator implies permuting of state bytes and XOR operations but the downside of implementing this function is handling the block size variants for each mode, S390x arch supports the standard block sizes so we can branch for each standard size in the supported modes but should we consider unexpected block size during the implementation?
regards, Mamone
On Sun, Aug 29, 2021 at 5:39 PM Maamoun TK maamoun.tk@googlemail.com wrote:
I added support for the sha1_compress_n function on arm architecture in the same branch https://git.lysator.liu.se/mamonet/nettle/-/tree/sha1-compress-n
regards, Mamone
On Sat, Aug 21, 2021 at 5:22 AM Maamoun TK maamoun.tk@googlemail.com wrote:
On Thu, Aug 19, 2021 at 8:48 AM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n function for that particular type?
That's an input file for an obscure "loop mixer" tool, IIRC, it was written mainly by David Harvey for use with GMP loops. This tool tries permuting the instructions of an assembly loop, taking dependencies into account, benchmarks each variant, and tries to find the fastest instruction sequence. It seems I tried this toool on x86 sha1_compress back in 2009, on an AMD K7, and it gave a 17% speedup at the time, according to commit message for 1e757582ac7f8465b213d9761e17c33bd21ca686.
So you can just ignore this file. And you may want to look at the more readable version of x86/sha1_compress.asm, just before that commit.
Thanks, I left the nlms files as are and modified x86/sha1_compress.asm to work with the sha1_compress_n function. I've kept the function parameters in the stack since the instructions are able to execute on memory operands and x86 calling convention passes the parameters through the stack, I'm not sure if those parameters are read-only or can be adjustable, TBH I haven't run into x86 32-bit code for 8 years. What I did is reserving fields in the stack for two parameters and adjusting both values in the new locations to keep the original values unmodified.
regards, Mamone
On Sun, Aug 29, 2021 at 5:52 PM Maamoun TK maamoun.tk@googlemail.com wrote:
Applying hardware-accelerated SHA3 instruction to optimize sha3_permute function for s390x arch has an insignificant impact on the performance, I'm wondering what we can do to take full advantage of those instructions. Optimizing sha3_absorb seems a good way to go since the s390x-specific accelerator implies permuting of state bytes and XOR operations but the downside of implementing this function is handling the block size variants for each mode, S390x arch supports the standard block sizes so we can branch for each standard size in the supported modes but should we consider unexpected block size during the implementation?
I got almost 12% speedup of optimizing the sha3_permute() function using the SHA hardware accelerator of s390x, is it worth adding that assembly implementation? I'll attach the patch at the end of this email.
In another topic, are you aware of any CFarm alternative that have arm64 machine with SHA-256 and SHA3 support to continue optimizing those functions for aarch64 architecture in addition to x86_64 machine with shani support to complete the patch of sha1_comoress_n() function and maximize the performance of SHA1 compress function on hardware-supported architectures.
C s390x/msa_x6/sha3-permute.asm
ifelse(` Copyright (C) 2021 Mamone Tarsha This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
C KIMD (COMPUTE INTERMEDIATE MESSAGE DIGEST) is specefied in C "z/Architecture Principles of Operation SA22-7832-12" as follows: C A function specified by the function code in general register 0 is performed. C General register 1 contains the logical address of the leftmost byte of the parameter block in storage. C the second operand is processed as specified by the function code using an initial chaining value in C the parameter block, and the result replaces the chaining value.
C This implementation uses KIMD-SHA3-512 function. C The parameter block used for the KIMD-SHA3-512 function has the following format: C *----------------------------------------------* C | ICV (200 bytes) | C *----------------------------------------------*
C SHA function code define(`SHA3_512_FUNCTION_CODE', `35') C Size of block define(`SHA3_512_BLOCK_SIZE', `72') C Size of state define(`SHA3_STATE_SIZE', `200')
.file "sha3-permute.asm"
.text
C void C sha3_permute(struct sha3_ctx *ctx)
PROLOGUE(nettle_sha3_permute) lghi %r0,SHA3_512_FUNCTION_CODE C FUNCTION_CODE ALLOC_STACK(%r1,SHA3_STATE_SIZE+SHA3_512_BLOCK_SIZE) .irp idx, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 mvcin \idx*8(8,%r1),\idx*8+7(%r2) .endr la %r4,SHA3_STATE_SIZE (%r1) xc 0(SHA3_512_BLOCK_SIZE,%r4),0(%r4) lghi %r5,SHA3_512_BLOCK_SIZE 1: .long 0xb93e0004 C kimd %r0,%r4. perform KIMD-SHA operation on data brc 1,1b .irp idx, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 mvcin \idx*8(8,%r2),\idx*8+7(%r1) .endr FREE_STACK(SHA3_STATE_SIZE+SHA3_512_BLOCK_SIZE) br RA EPILOGUE(nettle_sha3_permute)
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
I got almost 12% speedup of optimizing the sha3_permute() function using the SHA hardware accelerator of s390x, is it worth adding that assembly implementation?
For such a small assembly function, I think it's worth the effort (more questionable if it was worth adding the special instructions for it...).
If you have the time, you could also try out doing it with vector registers, like on x86_64 and arm/neon. Some difficulties in the x86_64 implementation were (i) xmm register shortage, (ii) moving 64-bit pieces between the 128-bit xmm registers, and (iii) rotating the 64-bit pieces of an xmm register by different shift counts.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se