Poly1305 based on radix 2^44 for S390x and PowerPC - nettle-bugs

23 Feb 2022


      POWER9 and Z14 have introduced ' vmsumudm' and 'vmslg' instructions
respectively for full 64-bit vector multiplication. This note demonstrates
implementing Poly1305 based on radix 2^44 for optimal architecture
utilization.
In radix B = 2^44 we have the state and key as follows
H = B^2 H_2 + B H_1 + H_0
R = B^2 R_2 + B R_1 + R_0
Where degrees of H_2 and R_2 are 42 and 36 respectively whereas other
coefficients are of degree 44
We have B^3 = 20 so we can pre-compute R_1' = 20 * R_1 and R_2' = 20 * R_2
Now to multiply H by R we can write
R H = B^2(R_2 H_0 + R_1 H_1 + R_0 H_2) + B(R_1 H_0 + R_0 H_1 + R_2' H_2)
                  ____________T2__________/
____________T1__________/
+ R_0 H_0 + R_2' H_1 + R_1' H_2
   ____________T0___________/
T_0 < 2^88 + 2^85 + 2^91 = 2^93
T_1 < 2^88 + 2^88 + 2^83 = 2^90
T_2 < 2^80 + 2^88 + 2^86 = 2^90
This is a good place to combine the interleaved multiplication state
products of next consecutive blocks (In this case we interleave four-block
multiplications by per-computing powers of key R) which rises up the degree
of product parts.
T_0 <  2^93 + 3*2^93 = 2^96
T_1 <  2^90 + 3*2^91 = 2^94
T_2 <  2^90 + 3*2^90 = 2^93
Now let's reduce this product to 2^130 divided to 2^44, 2^44, 2^42 for each
part.
T_0 ------------> T_1 ---------------> T_2 -------------------> T_0
---------------> T_1
        (96-44)             (94-44+1)             (93-42+1+3)
(55-44+1)
This chain keeps the carry addition for the last part (2^44+2^12) less than
2^44+2^22.
Note the carry from T_2 would be multiplied by 5 before adding it to T_0
While the sequential carry addition achieves decent performance speed on
PowerPC arch, an interleaved carry handling would get more speed up for
s390x so let's figure an interleaved variant of carry addition.
*----------------------------------------------------------------------------------------------*
|            Phase 1           |             Phase 2             |
  Phase 3           |
|------------------------------|---------------------------------|-------------------------------|
|   T_1 ------------> T_2  |  T_2 -----------------> T_0 |  T_0
--------------> T_1 |
|            (94-44)            |          (93-42+1+3)          |
(55-44+1)           |
|------------------------------|---------------------------------|-------------------------------|
|   T_0 -------------> T_1 |  T_1 --------------> T_2    |
                    |
|            (96-44)            |          (52-44+1)
|                                     |
*-----------------------------------------------------------------------------------------------*
This variant implies 3 sequential phases rather than four. Moreover, the
long carry path from T_2 -> T_0 would've executed in parallel with another
carry path.
Patches of implementation based on radix 2^44 for both architectures are
submitted to nettle with fat build support.
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39
https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41
Benchmark the implementations on POWER9 and Z15
*---------------------------------------------------------------------------------------------*
|    Arch           |   Nettle patch (radix 2^44)    |   OpenSSL (radix
2^26)   |
|-------------------|--------------------------------------|-----------------------------------|
|   PowerPC    |    0.972 cycles/byte              |    1.453 cycles/byte
       |
|-------------------|--------------------------------------|-----------------------------------|
|   IBMz           |    0.840 cycles/byte              |    0.936
cycles/byte          |
*---------------------------------------------------------------------------------------------*
This note can also be applied on x86_64 arch with 'AVX512VL' and
'AVX512_IFMA' extension support by utilizing 'VPMADD52HUQ' and
'VPMADD52LUQ' instructions of full 52-bit multiplication.
regards,
Mamone