On Sun, Jan 23, 2022 at 4:41 PM Maamoun TK maamoun.tk@googlemail.com wrote:
On Sun, Jan 23, 2022 at 9:10 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
The current C implementation uses radix 26, and 25 multiplies (32x32 --> 64) per block. And quite a lot of shifts. A radix 32 variant analogous to the above would need 16 long multiplies and 4 short. I'd expect that to be faster on most machines, but I'd have to try that out.
I've tried this out, see attached file. It has an #if 0/1 to choose between radix 64 (depending on the non-standard __int128 type for accumulated products) and radix 32 (portable C).
This is the speed I get for C implementations of poly1305_update on my x86_64 laptop:
Radix 26: 1.2 GByte/s (old code)
Radix 32: 1.3 GByte/s
Radix 64: 2.2 GByte/s
It would be interesting with benchmarks on actual 32-bit hardware, 32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
I made a performance test of this patch on the available architectures I have access to.
Arm64 (gcc117 gfarm):
- Radix 26: 0.65 GByte/s
- Radix 26 (2-way interleaved): 0.92 GByte/s
- Radix 32: 0.55 GByte/s
- Radix 64: 0.58 GByte/s
POWER9:
- Radix 26: 0.47 GByte/s
- Radix 26 (2-way interleaved): 1.15 GByte/s
- Radix 32: 0.52 GByte/s
- Radix 64: 0.58 GByte/s
Z15:
- Radix 26: 0.65 GByte/s
- Radix 26 (2-way interleaved): 3.17 GByte/s
- Radix 32: 0.82 GByte/s
- Radix 64: 1.22 GByte/s
Apparently, the higher radix version has performance improvements on x86_64, powerpc, and s390x but this is not the case for arm64 arch where the performance has a slight hit there.
That might be an artifact of the specific ARM processor microarchitecture or the memory subsystem of the ARM system, not inherent to the Arm AArch64 architecture and ISA.
- David
I tried to compile the new code with -m32 flag on x86_64 but I got "poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this target". Unfortunately, I don't have access to arm 32-bit too.
Also, I've disassembled the update function of Radix 64 and none of the architectures has made use of SIMD support (including x86_64 that hasn't used XMM registers which is standard for this arch, I don't know if gcc supports such behavior for C compiling but I'm aware that MSVC takes advantage of that standardization for further optimization on compiled C code).
I'm trying to implement the radix 64 using SIMD to see if we can get any performance boost, I'll post the result once I get done with it.
regards, Mamone
If I understood correctly, the suggestion to use radix 26 in djb's original paper was motivated by a high-speed implementation using floating point arithmetic (possibly in combination with SIMD), where the product of two 26-bit integers can be represented exactly in an IEEE double (but it gets a bit subtle if we want to accumulate several products), I haven't really looked into implementing poly1305 with either floating point or SIMD.
To improve test coverage, I've also extended poly1305 tests with tests on random inputs, with results compared to a reference implementation based on gmp/mini-gmp. I intend to merge those testing changes soon. See
https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755... .
Unfortunately, the http interface of the main git repo at Lysator is inaccessible at the moment due to an expired certificate; should be fixed in a day or two.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs