nisse@lysator.liu.se (Niels Möller) writes:
The current C implementation uses radix 26, and 25 multiplies (32x32 --> 64) per block. And quite a lot of shifts. A radix 32 variant analogous to the above would need 16 long multiplies and 4 short. I'd expect that to be faster on most machines, but I'd have to try that out.
I've tried this out, see attached file. It has an #if 0/1 to choose between radix 64 (depending on the non-standard __int128 type for accumulated products) and radix 32 (portable C).
This is the speed I get for C implementations of poly1305_update on my x86_64 laptop:
* Radix 26: 1.2 GByte/s (old code)
* Radix 32: 1.3 GByte/s
* Radix 64: 2.2 GByte/s
It would be interesting with benchmarks on actual 32-bit hardware, 32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
If I understood correctly, the suggestion to use radix 26 in djb's original paper was motivated by a high-speed implementation using floating point arithmetic (possibly in combination with SIMD), where the product of two 26-bit integers can be represented exactly in an IEEE double (but it gets a bit subtle if we want to accumulate several products), I haven't really looked into implementing poly1305 with either floating point or SIMD.
To improve test coverage, I've also extended poly1305 tests with tests on random inputs, with results compared to a reference implementation based on gmp/mini-gmp. I intend to merge those testing changes soon. See https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755....
Unfortunately, the http interface of the main git repo at Lysator is inaccessible at the moment due to an expired certificate; should be fixed in a day or two.
Regards, /Niels