I created merge requests that have improvements of Poly1305 for arm64, powerpc64, and s390x architectures by following using two-way interleaving. https://git.lysator.liu.se/nettle/nettle/-/merge_requests/38 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41 The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64, and 382.65% speedup for s390x.
OpenSSL is still ahead in terms of performance speed since it uses 4-way interleaving or maybe more!! Increasing the interleaving ways more than two has nothing to do with parallelism since the execution units are already saturated by using 2-ways for the three architectures. The reason behind the performance improvement is the number of execution times of reduction procedure is cutted by half for 4-way interleaving since the products of multiplying state parts by key can be combined before the reduction phase. Let me know if you are interested in doing that on nettle!
It would be nice if the arm64 patch will be tested on big-endian mode since I don't have access to any big-endian variant for testing.
regards, Mamone