On Wed, Jan 19, 2022 at 10:06 PM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64, and 382.65% speedup for s390x.
OpenSSL is still ahead in terms of performance speed since it uses 4-way interleaving or maybe more!! Increasing the interleaving ways more than two has nothing to do with parallelism since the execution units are already saturated by using
2-ways
for the three architectures. The reason behind the performance
improvement
is the number of execution times of reduction procedure is cutted by half for 4-way interleaving since the products of multiplying state parts by
key
can be combined before the reduction phase. Let me know if you are interested in doing that on nettle!
Interesting. I haven't paid much attention to the poly1305 implementation since it was added back in 2013. The C implementation doesn't try to use wider multiplication than 32x32 --> 64, which is poor for 64-bit platforms. Maybe we could use unsigned __int128 if we can write a configure test to check if it is available and likely to be efficient?
Wider multiplication would improve the performance for 64-bit general registers but as the case for the current SIMD implementation, the radix 2^26 fits well there.
For most efficient interleaving, I take it one should precompute some powers of the key, similar to how it's done in the recent gcm code?
Since the loop of block iteration is moved to inside the assembly implementation, computing one multiple of key at the function prologue should be ok.
I forgot to mention that the reduction phase uses the tips instructed in Reduction section in https://cryptojedi.org/papers/neoncrypto-20120320.pdf for arm64 and s390x implementations while the chain path of h0 -> h1 -> h2 -> h3 -> h4 -> h0 -> h1 still manages to achieve slightly higher performance than the two independent carry path on powerpc64 arch.
regards, Mamone
It would be nice if the arm64 patch will be tested on big-endian mode
since
I don't have access to any big-endian variant for testing.
Merged this one too on a branch for ci testing.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.