Maamoun TK maamoun.tk@googlemail.com writes:
I got almost 12% speedup of optimizing the sha3_permute() function using the SHA hardware accelerator of s390x, is it worth adding that assembly implementation?
For such a small assembly function, I think it's worth the effort (more questionable if it was worth adding the special instructions for it...).
If you have the time, you could also try out doing it with vector registers, like on x86_64 and arm/neon. Some difficulties in the x86_64 implementation were (i) xmm register shortage, (ii) moving 64-bit pieces between the 128-bit xmm registers, and (iii) rotating the 64-bit pieces of an xmm register by different shift counts.
Regards, /Niels