Applying hardware-accelerated SHA3 instruction to optimize sha3_permute function for s390x arch has an insignificant impact on the performance, I'm wondering what we can do to take full advantage of those instructions. Optimizing sha3_absorb seems a good way to go since the s390x-specific accelerator implies permuting of state bytes and XOR operations but the downside of implementing this function is handling the block size variants for each mode, S390x arch supports the standard block sizes so we can branch for each standard size in the supported modes but should we consider unexpected block size during the implementation?
regards, Mamone
On Sun, Aug 29, 2021 at 5:39 PM Maamoun TK maamoun.tk@googlemail.com wrote:
I added support for the sha1_compress_n function on arm architecture in the same branch https://git.lysator.liu.se/mamonet/nettle/-/tree/sha1-compress-n
regards, Mamone
On Sat, Aug 21, 2021 at 5:22 AM Maamoun TK maamoun.tk@googlemail.com wrote:
On Thu, Aug 19, 2021 at 8:48 AM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
What is x86/sha1-compress.nlms? How can I implement nettle_copmress_n function for that particular type?
That's an input file for an obscure "loop mixer" tool, IIRC, it was written mainly by David Harvey for use with GMP loops. This tool tries permuting the instructions of an assembly loop, taking dependencies into account, benchmarks each variant, and tries to find the fastest instruction sequence. It seems I tried this toool on x86 sha1_compress back in 2009, on an AMD K7, and it gave a 17% speedup at the time, according to commit message for 1e757582ac7f8465b213d9761e17c33bd21ca686.
So you can just ignore this file. And you may want to look at the more readable version of x86/sha1_compress.asm, just before that commit.
Thanks, I left the nlms files as are and modified x86/sha1_compress.asm to work with the sha1_compress_n function. I've kept the function parameters in the stack since the instructions are able to execute on memory operands and x86 calling convention passes the parameters through the stack, I'm not sure if those parameters are read-only or can be adjustable, TBH I haven't run into x86 32-bit code for 8 years. What I did is reserving fields in the stack for two parameters and adjusting both values in the new locations to keep the original values unmodified.
regards, Mamone