Maamoun TK maamoun.tk@googlemail.com writes:
The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64, and 382.65% speedup for s390x.
OpenSSL is still ahead in terms of performance speed since it uses 4-way interleaving or maybe more!! Increasing the interleaving ways more than two has nothing to do with parallelism since the execution units are already saturated by using 2-ways for the three architectures. The reason behind the performance improvement is the number of execution times of reduction procedure is cutted by half for 4-way interleaving since the products of multiplying state parts by key can be combined before the reduction phase. Let me know if you are interested in doing that on nettle!
Interesting. I haven't paid much attention to the poly1305 implementation since it was added back in 2013. The C implementation doesn't try to use wider multiplication than 32x32 --> 64, which is poor for 64-bit platforms. Maybe we could use unsigned __int128 if we can write a configure test to check if it is available and likely to be efficient?
For most efficient interleaving, I take it one should precompute some powers of the key, similar to how it's done in the recent gcm code?
It would be nice if the arm64 patch will be tested on big-endian mode since I don't have access to any big-endian variant for testing.
Merged this one too on a branch for ci testing.
Regards, /Niels