Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

27 Jan 2022


      nisse@lysator.liu.se (Niels Möller) writes:
...
...
Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
And I've now tried the same method for the x86_64 implementation. See
attached file + needed patch to asm.m4. This gives 2.9 GByte/s.
I'm not entirely sure cycle numbers are accurate, with clock frequence
not being fixed. I think the machine runs bechmarks at 2.1GHz, and then
this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4
instructions per cycle, 0.5 multiply instructions per cycle.
This laptop has an AMD zen2 processor, which should be capable of
issuing four instructions per cycle and complete one multiply
instruction per cycle (according to
https://gmplib.org/~tege/x86-timing.pdf).
This seems to indicate that on this hardware, speed is not limited by
multiplier throughput, instead, the bottleneck is instruction
decoding/issuing, with max four instructions per cycle.
Benchmarked also on my other nearby x86_64 machine (intel broadwell
processor). It's faster there too (from 1.4 GByte/s to 1.75). I'd expect
it to be generally faster, and have pushed it to the master-updates
branch.
I haven't looked that carefully at what the old code was doing, but I
think the final folding for each block used a multiply instruction that
then depends on the previous ones for that block, increasing the per
block latency. With the new code, all multiplies done for a block are
independent of each other.
Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305