On Tue, Aug 10, 2021 at 11:55 PM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
I made a merge request in the main repository that optimizes SHA1 for
s390x
architecture with fat build support !33 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33.
Regarding the discussion on https://git.lysator.liu.se/nettle/nettle/-/merge_requests/33#note_10005: It seems the sha1 instructions on s390x are fast enough that the overhead of loading constants, and loading and storing the state, all per block, is a significant cost.
I think it makes sense to change the internal convention for _sha1_compress so that it can do multiple blocks. There are currently 5 assembly implementations that would need updating: arm/v6, arm64/crypto, x86, x86_64 and x86_64/sha_ni. And the C implementation, of course.
If it turns out to be too large a change to do them all at once, one could introduce some new _sha1_compress_n function or the like, and use when available. Actually, we probably need to do that anyway, since for historical reasons, _nettle_sha1_compress is a public function, and needs to be kept (as just a simple C wrapper) for backwards compatibility. Changing it incrementally should be doable but a bit hairy.
There are some other similar compression functions with assembly implementation, for md5, sha256 and sha512. But there's no need to change them all at the same time, or at all.
Regarding the MD_UPDATE macro, that one is defined in the public header file macros.h (which in retrospect was a mistake). So it's probably best to leave it unchanged. New macros for the new convention should be put into some internal header, e.g., md-internal.h.
I've initialized a support of sha1_compress_n function in this branch https://git.lysator.liu.se/mamonet/nettle/-/tree/sha1-compress-n The function works and performs as exprected, I also adapted sha1_compress of s390x and arm64 with the new compress function. Predictably, SHA1 update is now equally performing with the OpenSSL function on arm64 architecture. Benchmark of executing examples/nettle-benchmark on arm64: Algorithm mode Mbyte/s sha1 update 849.82 openssl sha1 update 849.73 Benchmark of executing examples/nettle-benchmark on s390x: Algorithm mode Mbyte/s sha1 update 1791.25 The s390x performance of the new compress function now doubles the speed of the single block optimized function using built-in SHA1 accelerator. Yet, there are implementations of x86, x86_64, and arm architectures to adapt with the new compress function, and the patch may have potential for further improvements in terms of naming convention and documentation.
regards, Mamone