Re: [Arm64, PowerPC64, S390x] Optimize Poly1305

24 Jan 2022


      Maamoun TK maamoun.tk@googlemail.com writes:
...
I made a performance test of this patch on the available architectures I
have access to.
Arm64 (gcc117 gfarm):

Radix 26: 0.65 GByte/s
Radix 26 (2-way interleaved): 0.92 GByte/s
Radix 32: 0.55 GByte/s
Radix 64: 0.58 GByte/s

POWER9:

Radix 26: 0.47 GByte/s
Radix 26 (2-way interleaved): 1.15 GByte/s
Radix 32: 0.52 GByte/s
Radix 64: 0.58 GByte/s

Z15:

Radix 26: 0.65 GByte/s
Radix 26 (2-way interleaved): 3.17 GByte/s
Radix 32: 0.82 GByte/s
Radix 64: 1.22 GByte/s

Interesting. I'm a bit surprised the radix-64 doesn't perform better, in
particular on arm64. (But I'm not yet familiar with arm64 multiply
instructions).
Numbers for 2-way interleaving are impressive, I'd like to understand
how that works. Might be useful derive corresponding multiply
throughput, i.e., number of multiply operations (and with which multiply
instruction) completed per cycle, as well as total cycles per block
It looks like the folding done per-block in the radix-64 code costs at
least 5 or so cycles per block (since these operations are all
dependent, and we also have the multiply by 5 in there, probably adding
a few cycles more). Maybe at least the multiply can be postponed.
...
I tried to compile the new code with -m32 flag on x86_64 but I got
"poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this
target".
That's expected, in two ways: I don't expect radix-64 to give any
performance gain over radix-32 on any 32-bit archs. And I think __int128
is supported only on archs where it fits in two registers. If we start
using __int128 we need a configure test for it, and then it actually
makes things simpler, at least for this in this usecase, if it stays
unsupported on 32-bit archs where it shouldn't be used.
So to compile with -m32, the radix-64 code must be #if:ed out.
...
Also, I've disassembled the update function of Radix 64 and none of the
architectures has made use of SIMD support (including x86_64 that hasn't
used XMM registers which is standard for this arch, I don't know if gcc
supports such behavior for C compiling but I'm aware that MSVC takes
advantage of that standardization for further optimization on compiled C
code).
The radix-64 code really wants multiply instruction(s) for 64x64 -->
128, and I think that's not so common SIMD instruction sets (but
powerpc64 vmsumudm looks potentially useful?) Either as a
single instruction, or as a pair of mulhigh/mullow instructions. And
some not too complicated way to do a 128-bit add with proper carry
propagation in the middle.
Arm32 neon does have 32x32 --> 64, which looks like a good fit for the
radix-32 variant.
Regards,
/Niels
-- 
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [Arm64, PowerPC64, S390x] Optimize Poly1305