Maamoun TK maamoun.tk@googlemail.com writes:
I made a performance test of this patch on the available architectures I have access to.
Arm64 (gcc117 gfarm):
- Radix 26: 0.65 GByte/s
- Radix 26 (2-way interleaved): 0.92 GByte/s
- Radix 32: 0.55 GByte/s
- Radix 64: 0.58 GByte/s
POWER9:
- Radix 26: 0.47 GByte/s
- Radix 26 (2-way interleaved): 1.15 GByte/s
- Radix 32: 0.52 GByte/s
- Radix 64: 0.58 GByte/s
Z15:
- Radix 26: 0.65 GByte/s
- Radix 26 (2-way interleaved): 3.17 GByte/s
- Radix 32: 0.82 GByte/s
- Radix 64: 1.22 GByte/s
Interesting. I'm a bit surprised the radix-64 doesn't perform better, in particular on arm64. (But I'm not yet familiar with arm64 multiply instructions).
Numbers for 2-way interleaving are impressive, I'd like to understand how that works. Might be useful derive corresponding multiply throughput, i.e., number of multiply operations (and with which multiply instruction) completed per cycle, as well as total cycles per block
It looks like the folding done per-block in the radix-64 code costs at least 5 or so cycles per block (since these operations are all dependent, and we also have the multiply by 5 in there, probably adding a few cycles more). Maybe at least the multiply can be postponed.
I tried to compile the new code with -m32 flag on x86_64 but I got "poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this target".
That's expected, in two ways: I don't expect radix-64 to give any performance gain over radix-32 on any 32-bit archs. And I think __int128 is supported only on archs where it fits in two registers. If we start using __int128 we need a configure test for it, and then it actually makes things simpler, at least for this in this usecase, if it stays unsupported on 32-bit archs where it shouldn't be used.
So to compile with -m32, the radix-64 code must be #if:ed out.
Also, I've disassembled the update function of Radix 64 and none of the architectures has made use of SIMD support (including x86_64 that hasn't used XMM registers which is standard for this arch, I don't know if gcc supports such behavior for C compiling but I'm aware that MSVC takes advantage of that standardization for further optimization on compiled C code).
The radix-64 code really wants multiply instruction(s) for 64x64 --> 128, and I think that's not so common SIMD instruction sets (but powerpc64 vmsumudm looks potentially useful?) Either as a single instruction, or as a pair of mulhigh/mullow instructions. And some not too complicated way to do a 128-bit add with proper carry propagation in the middle.
Arm32 neon does have 32x32 --> 64, which looks like a good fit for the radix-32 variant.
Regards, /Niels