[S390x] Optimize SHA1 with fat build support

List overview All Threads
Download

newer

older

Structural fixes to the manual

[S390x] Optimize AES modes

Maamoun TK

30 Jul 2021 30 Jul '21

1:25 p.m.

I made a merge request in the main repository that optimizes SHA1 for s390x architecture with fat build support !33 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33.

regards, Mamone

Show replies by date

nisse＠lysator.liu.se

10 Aug 10 Aug

9:55 p.m.

Maamoun TK maamoun.tk@googlemail.com writes:

...

I made a merge request in the main repository that optimizes SHA1 for s390x architecture with fat build support !33 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33.

Regarding the discussion on https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33#note_10005: It seems the sha1 instructions on s390x are fast enough that the overhead of loading constants, and loading and storing the state, all per block, is a significant cost.

I think it makes sense to change the internal convention for _sha1_compress so that it can do multiple blocks. There are currently 5 assembly implementations that would need updating: arm/v6, arm64/crypto, x86, x86_64 and x86_64/sha_ni. And the C implementation, of course.

If it turns out to be too large a change to do them all at once, one could introduce some new _sha1_compress_n function or the like, and use when available. Actually, we probably need to do that anyway, since for historical reasons, _nettle_sha1_compress is a public function, and needs to be kept (as just a simple C wrapper) for backwards compatibility. Changing it incrementally should be doable but a bit hairy.

There are some other similar compression functions with assembly implementation, for md5, sha256 and sha512. But there's no need to change them all at the same time, or at all.

Regarding the MD_UPDATE macro, that one is defined in the public header file macros.h (which in retrospect was a mistake). So it's probably best to leave it unchanged. New macros for the new convention should be put into some internal header, e.g., md-internal.h.

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

Maamoun TK

12 Aug 12 Aug

2:26 p.m.

On Tue, Aug 10, 2021 at 11:55 PM Niels Möller nisse@lysator.liu.se wrote:

...

Maamoun TK maamoun.tk@googlemail.com writes:

...
I made a merge request in the main repository that optimizes SHA1 for

s390x

...
architecture with fat build support !33 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33.

Regarding the discussion on https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33#note_10005: It seems the sha1 instructions on s390x are fast enough that the overhead of loading constants, and loading and storing the state, all per block, is a significant cost.

I think it makes sense to change the internal convention for _sha1_compress so that it can do multiple blocks. There are currently 5 assembly implementations that would need updating: arm/v6, arm64/crypto, x86, x86_64 and x86_64/sha_ni. And the C implementation, of course.

If it turns out to be too large a change to do them all at once, one could introduce some new _sha1_compress_n function or the like, and use when available. Actually, we probably need to do that anyway, since for historical reasons, _nettle_sha1_compress is a public function, and needs to be kept (as just a simple C wrapper) for backwards compatibility. Changing it incrementally should be doable but a bit hairy.

There are some other similar compression functions with assembly implementation, for md5, sha256 and sha512. But there's no need to change them all at the same time, or at all.

Regarding the MD_UPDATE macro, that one is defined in the public header file macros.h (which in retrospect was a mistake). So it's probably best to leave it unchanged. New macros for the new convention should be put into some internal header, e.g., md-internal.h.

I've initialized a support of sha1_compress_n function in this branch https://git.lysator.liu.se/mamonet/nettle/-/tree/sha1-compress-n The function works and performs as exprected, I also adapted sha1_compress of s390x and arm64 with the new compress function. Predictably, SHA1 update is now equally performing with the OpenSSL function on arm64 architecture. Benchmark of executing examples/nettle-benchmark on arm64: Algorithm mode Mbyte/s sha1 update 849.82 openssl sha1 update 849.73 Benchmark of executing examples/nettle-benchmark on s390x: Algorithm mode Mbyte/s sha1 update 1791.25 The s390x performance of the new compress function now doubles the speed of the single block optimized function using built-in SHA1 accelerator. Yet, there are implementations of x86, x86_64, and arm architectures to adapt with the new compress function, and the patch may have potential for further improvements in terms of naming convention and documentation.

regards, Mamone

Maamoun TK

15 Aug 15 Aug

12:10 a.m.

On Thu, Aug 12, 2021 at 4:26 PM Maamoun TK maamoun.tk@googlemail.com wrote:

...

On Tue, Aug 10, 2021 at 11:55 PM Niels Möller nisse@lysator.liu.se wrote:

...
Maamoun TK maamoun.tk@googlemail.com writes:

...
I made a merge request in the main repository that optimizes SHA1 for

s390x

...
architecture with fat build support !33 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33.

Regarding the discussion on https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33#note_10005: It seems the sha1 instructions on s390x are fast enough that the overhead of loading constants, and loading and storing the state, all per block, is a significant cost.

I think it makes sense to change the internal convention for _sha1_compress so that it can do multiple blocks. There are currently 5 assembly implementations that would need updating: arm/v6, arm64/crypto, x86, x86_64 and x86_64/sha_ni. And the C implementation, of course.

If it turns out to be too large a change to do them all at once, one could introduce some new _sha1_compress_n function or the like, and use when available. Actually, we probably need to do that anyway, since for historical reasons, _nettle_sha1_compress is a public function, and needs to be kept (as just a simple C wrapper) for backwards compatibility. Changing it incrementally should be doable but a bit hairy.

There are some other similar compression functions with assembly implementation, for md5, sha256 and sha512. But there's no need to change them all at the same time, or at all.

Regarding the MD_UPDATE macro, that one is defined in the public header file macros.h (which in retrospect was a mistake). So it's probably best to leave it unchanged. New macros for the new convention should be put into some internal header, e.g., md-internal.h.

Yet, there are implementations of x86, x86_64, and arm architectures to adapt with the new compress function

Modified basic x86_64 implementation to sha1_compress_n function in the same branch. Unfortunately, my x86_64 CPU doesn't support SHA extension so I'm trying to figure out a simple way to test the hardware-accelerated implementation.

regards, Mamone

Maamoun TK

18 Aug 18 Aug

9:29 p.m.

What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n function for that particular type?

regards, Mamone

On Sun, Aug 15, 2021 at 2:10 AM Maamoun TK maamoun.tk@googlemail.com wrote:

...

On Thu, Aug 12, 2021 at 4:26 PM Maamoun TK maamoun.tk@googlemail.com wrote:

...
On Tue, Aug 10, 2021 at 11:55 PM Niels Möller nisse@lysator.liu.se wrote:

...
Maamoun TK maamoun.tk@googlemail.com writes:

...
I made a merge request in the main repository that optimizes SHA1 for

s390x

...
architecture with fat build support !33 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33.

Regarding the discussion on https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33#note_10005: It seems the sha1 instructions on s390x are fast enough that the overhead of loading constants, and loading and storing the state, all per block, is a significant cost.

I think it makes sense to change the internal convention for _sha1_compress so that it can do multiple blocks. There are currently 5 assembly implementations that would need updating: arm/v6, arm64/crypto, x86, x86_64 and x86_64/sha_ni. And the C implementation, of course.

If it turns out to be too large a change to do them all at once, one could introduce some new _sha1_compress_n function or the like, and use when available. Actually, we probably need to do that anyway, since for historical reasons, _nettle_sha1_compress is a public function, and needs to be kept (as just a simple C wrapper) for backwards compatibility. Changing it incrementally should be doable but a bit hairy.

There are some other similar compression functions with assembly implementation, for md5, sha256 and sha512. But there's no need to change them all at the same time, or at all.

Regarding the MD_UPDATE macro, that one is defined in the public header file macros.h (which in retrospect was a mistake). So it's probably best to leave it unchanged. New macros for the new convention should be put into some internal header, e.g., md-internal.h.

Yet, there are implementations of x86, x86_64, and arm architectures to adapt with the new compress function

Modified basic x86_64 implementation to sha1_compress_n function in the same branch. Unfortunately, my x86_64 CPU doesn't support SHA extension so I'm trying to figure out a simple way to test the hardware-accelerated implementation.

regards, Mamone

nisse＠lysator.liu.se

19 Aug 19 Aug

6:48 a.m.

Maamoun TK maamoun.tk@googlemail.com writes:

...

What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n function for that particular type?

That's an input file for an obscure "loop mixer" tool, IIRC, it was written mainly by David Harvey for use with GMP loops. This tool tries permuting the instructions of an assembly loop, taking dependencies into account, benchmarks each variant, and tries to find the fastest instruction sequence. It seems I tried this toool on x86 sha1_compress back in 2009, on an AMD K7, and it gave a 17% speedup at the time, according to commit message for 1e757582ac7f8465b213d9761e17c33bd21ca686.

So you can just ignore this file. And you may want to look at the more readable version of x86/sha1_compress.asm, just before that commit.

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

Maamoun TK

21 Aug 21 Aug

4:22 a.m.

On Thu, Aug 19, 2021 at 8:48 AM Niels Möller nisse@lysator.liu.se wrote:

...

Maamoun TK maamoun.tk@googlemail.com writes:

...
What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n function for that particular type?

That's an input file for an obscure "loop mixer" tool, IIRC, it was written mainly by David Harvey for use with GMP loops. This tool tries permuting the instructions of an assembly loop, taking dependencies into account, benchmarks each variant, and tries to find the fastest instruction sequence. It seems I tried this toool on x86 sha1_compress back in 2009, on an AMD K7, and it gave a 17% speedup at the time, according to commit message for 1e757582ac7f8465b213d9761e17c33bd21ca686.

So you can just ignore this file. And you may want to look at the more readable version of x86/sha1_compress.asm, just before that commit.

Thanks, I left the nlms files as are and modified x86/sha1_compress.asm to work with the sha1_compress_n function. I've kept the function parameters in the stack since the instructions are able to execute on memory operands and x86 calling convention passes the parameters through the stack, I'm not sure if those parameters are read-only or can be adjustable, TBH I haven't run into x86 32-bit code for 8 years. What I did is reserving fields in the stack for two parameters and adjusting both values in the new locations to keep the original values unmodified.

regards, Mamone

Maamoun TK

29 Aug 29 Aug

4:39 p.m.

I added support for the sha1_compress_n function on arm architecture in the same branch https://git.lysator.liu.se/mamonet/nettle/-/tree/sha1-compress-n

regards, Mamone

On Sat, Aug 21, 2021 at 5:22 AM Maamoun TK maamoun.tk@googlemail.com wrote:

...

On Thu, Aug 19, 2021 at 8:48 AM Niels Möller nisse@lysator.liu.se wrote:

...
Maamoun TK maamoun.tk@googlemail.com writes:

...
What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n function for that particular type?

That's an input file for an obscure "loop mixer" tool, IIRC, it was written mainly by David Harvey for use with GMP loops. This tool tries permuting the instructions of an assembly loop, taking dependencies into account, benchmarks each variant, and tries to find the fastest instruction sequence. It seems I tried this toool on x86 sha1_compress back in 2009, on an AMD K7, and it gave a 17% speedup at the time, according to commit message for 1e757582ac7f8465b213d9761e17c33bd21ca686.

So you can just ignore this file. And you may want to look at the more readable version of x86/sha1_compress.asm, just before that commit.

Thanks, I left the nlms files as are and modified x86/sha1_compress.asm to work with the sha1_compress_n function. I've kept the function parameters in the stack since the instructions are able to execute on memory operands and x86 calling convention passes the parameters through the stack, I'm not sure if those parameters are read-only or can be adjustable, TBH I haven't run into x86 32-bit code for 8 years. What I did is reserving fields in the stack for two parameters and adjusting both values in the new locations to keep the original values unmodified.

regards, Mamone

Maamoun TK

4:52 p.m.

Applying hardware-accelerated SHA3 instruction to optimize sha3_permute function for s390x arch has an insignificant impact on the performance, I'm wondering what we can do to take full advantage of those instructions. Optimizing sha3_absorb seems a good way to go since the s390x-specific accelerator implies permuting of state bytes and XOR operations but the downside of implementing this function is handling the block size variants for each mode, S390x arch supports the standard block sizes so we can branch for each standard size in the supported modes but should we consider unexpected block size during the implementation?

regards, Mamone

On Sun, Aug 29, 2021 at 5:39 PM Maamoun TK maamoun.tk@googlemail.com wrote:

...

I added support for the sha1_compress_n function on arm architecture in the same branch https://git.lysator.liu.se/mamonet/nettle/-/tree/sha1-compress-n

regards, Mamone

On Sat, Aug 21, 2021 at 5:22 AM Maamoun TK maamoun.tk@googlemail.com wrote:

...
On Thu, Aug 19, 2021 at 8:48 AM Niels Möller nisse@lysator.liu.se wrote:

...
Maamoun TK maamoun.tk@googlemail.com writes:

...
What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n function for that particular type?

That's an input file for an obscure "loop mixer" tool, IIRC, it was written mainly by David Harvey for use with GMP loops. This tool tries permuting the instructions of an assembly loop, taking dependencies into account, benchmarks each variant, and tries to find the fastest instruction sequence. It seems I tried this toool on x86 sha1_compress back in 2009, on an AMD K7, and it gave a 17% speedup at the time, according to commit message for 1e757582ac7f8465b213d9761e17c33bd21ca686.

So you can just ignore this file. And you may want to look at the more readable version of x86/sha1_compress.asm, just before that commit.

Thanks, I left the nlms files as are and modified x86/sha1_compress.asm to work with the sha1_compress_n function. I've kept the function parameters in the stack since the instructions are able to execute on memory operands and x86 calling convention passes the parameters through the stack, I'm not sure if those parameters are read-only or can be adjustable, TBH I haven't run into x86 32-bit code for 8 years. What I did is reserving fields in the stack for two parameters and adjusting both values in the new locations to keep the original values unmodified.

regards, Mamone

Maamoun TK

20 Sep 20 Sep

12:43 a.m.

On Sun, Aug 29, 2021 at 5:52 PM Maamoun TK maamoun.tk@googlemail.com wrote:

...

Applying hardware-accelerated SHA3 instruction to optimize sha3_permute function for s390x arch has an insignificant impact on the performance, I'm wondering what we can do to take full advantage of those instructions. Optimizing sha3_absorb seems a good way to go since the s390x-specific accelerator implies permuting of state bytes and XOR operations but the downside of implementing this function is handling the block size variants for each mode, S390x arch supports the standard block sizes so we can branch for each standard size in the supported modes but should we consider unexpected block size during the implementation?

I got almost 12% speedup of optimizing the sha3_permute() function using the SHA hardware accelerator of s390x, is it worth adding that assembly implementation? I'll attach the patch at the end of this email.

In another topic, are you aware of any CFarm alternative that have arm64 machine with SHA-256 and SHA3 support to continue optimizing those functions for aarch64 architecture in addition to x86_64 machine with shani support to complete the patch of sha1_comoress_n() function and maximize the performance of SHA1 compress function on hardware-supported architectures.

C s390x/msa_x6/sha3-permute.asm

GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:

* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.

* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

or both in parallel, as here.

GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')

C KIMD (COMPUTE INTERMEDIATE MESSAGE DIGEST) is specefied in C "z/Architecture Principles of Operation SA22-7832-12" as follows: C A function specified by the function code in general register 0 is performed. C General register 1 contains the logical address of the leftmost byte of the parameter block in storage. C the second operand is processed as specified by the function code using an initial chaining value in C the parameter block, and the result replaces the chaining value.

C This implementation uses KIMD-SHA3-512 function. C The parameter block used for the KIMD-SHA3-512 function has the following format: C *----------------------------------------------* C | ICV (200 bytes) | C *----------------------------------------------*

C SHA function code define(`SHA3_512_FUNCTION_CODE', `35') C Size of block define(`SHA3_512_BLOCK_SIZE', `72') C Size of state define(`SHA3_STATE_SIZE', `200')

.file "sha3-permute.asm"

.text

C void C sha3_permute(struct sha3_ctx *ctx)

PROLOGUE(nettle_sha3_permute) lghi %r0,SHA3_512_FUNCTION_CODE C FUNCTION_CODE ALLOC_STACK(%r1,SHA3_STATE_SIZE+SHA3_512_BLOCK_SIZE) .irp idx, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 mvcin \idx*8(8,%r1),\idx*8+7(%r2) .endr la %r4,SHA3_STATE_SIZE (%r1) xc 0(SHA3_512_BLOCK_SIZE,%r4),0(%r4) lghi %r5,SHA3_512_BLOCK_SIZE 1: .long 0xb93e0004 C kimd %r0,%r4. perform KIMD-SHA operation on data brc 1,1b .irp idx, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24 mvcin \idx*8(8,%r2),\idx*8+7(%r1) .endr FREE_STACK(SHA3_STATE_SIZE+SHA3_512_BLOCK_SIZE) br RA EPILOGUE(nettle_sha3_permute)

regards, Mamone

nisse＠lysator.liu.se

3:32 p.m.

Maamoun TK maamoun.tk@googlemail.com writes:

...

I got almost 12% speedup of optimizing the sha3_permute() function using the SHA hardware accelerator of s390x, is it worth adding that assembly implementation?

For such a small assembly function, I think it's worth the effort (more questionable if it was worth adding the special instructions for it...).

If you have the time, you could also try out doing it with vector registers, like on x86_64 and arm/neon. Some difficulties in the x86_64 implementation were (i) xmm register shortage, (ii) moving 64-bit pieces between the 128-bit xmm registers, and (iii) rotating the 64-bit pieces of an xmm register by different shift counts.

Regards, /Niels

-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.

1418

Age (days ago)

1470

Last active (days ago)

nettle-bugs@lists.lysator.liu.se

10 comments

2 participants

tags (0)

participants (2)

Maamoun TK
nisse＠lysator.liu.se