nisse@lysator.liu.se (Niels Möller) writes:
This is the speed I get for C implementations of poly1305_update on my x86_64 laptop:
Radix 26: 1.2 GByte/s (old code)
Radix 32: 1.3 GByte/s
Radix 64: 2.2 GByte/s
It would be interesting with benchmarks on actual 32-bit hardware, 32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
I've tried reworking folding, to reduce latency. Idea is to let the most significant state word be close to a word, rather than limited to <= 4 as in the previous version. When multiplying by r, split one of the multiplies to take out the low 2 bits. For the radix 64 version, that term is
B^2 t_2 * r0
Split t_2 as 4*hi + lo, then this can be reduced to
B^2 lo * r0 + hi * 5*r0
(Using the same old B^2 = 5/4 (mod p) in a slightly different way).
The 5*r0 fits one word and can be precomputed, and then this multiplication goes in parallell with the other multiplies, and no multiply left in the final per-block folding. With this trick I get on the same machine
Radix 32: 1.65 GByte/s
Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
I haven't yet done a strict analysis of bounds on the state and temporaries, but I would expect that it works out with no possibility of overflow.
See attached file. To fit the precomputed 5*r0 in a nice way I had to rearrange the unions in struct poly1305_ctx a bit, I also attach the patch to do this. Size of the struct should be the same, so I think it can be done without any abi bump.
Regards, /Niels