Hello,
Thank you for the suggestions and sorry for the shameless delay.
nisse@lysator.liu.se (Niels Möller) writes:
Also, optimized implementation of modular reduction is currently missing, which is beyond my expertise. I would appreciate any suggestions regarding that.
If we do Euclidean reduction, we should use the property that
2^448 = 2^224 + 1 (mod)
And we'd need to use this twice to reduce a 896-bit product to 448 bits. On 64-bit machines, we'll get some shifting since 224 isn't a multiple of 64.
Due to my ignorance, I probably don't get what you mean, but as far as I read the implementations of other curves in Nettle, some of them seem to use the property of generalized Mersenne numbers described in: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.2133 and it's also applicable to the Curve448 prime.
So a 896-bit product here:
a0 + a1 * 2^k + a2 * 2^2k + a3 * 2^3k (k = 224)
can be reduced to:
b0 + b1 * 2^k (mod p)
through modular additions of the smaller numbers:
b0 = a0 + a2 + a3 (mod p) b1 = a1 + a2 + a3 + a3 (mod p)
I tried to implement it (as attached) and got 20-30% speed-up with mini-gmp.
size modp reduce modq modinv mi_gcd mi_pow dup_jj ad_jja ad_hhh mul_g mul_a (us) 192 0.0066 0.0061 0.0770 28.90 0.000 0.00 0.698 0.811 1.014 44.2 188.5 224 0.0091 0.0091 0.1231 47.57 0.000 0.00 0.969 1.285 1.566 90.2 329.1 255 0.0081 0.0078 0.1550 18.70 0.000 0.00 0.767 1.014 1.082 88.7 278.4 256 0.1288 0.0115 0.1464 50.22 0.000 0.00 1.004 1.457 1.701 105.6 380.1 384 0.0170 0.0170 0.1260 106.51 0.000 0.00 1.791 2.240 3.113 305.9 1015.7 448 0.1892 0.1886 0.1886 171.57 0.000 0.00 3.630 4.914 4.940 842.1 1966.8 521 0.0163 0.0158 0.2821 217.44 0.000 0.00 3.555 4.507 6.107 788.8 2637.8
size modp reduce modq modinv mi_gcd mi_pow dup_jj ad_jja ad_hhh mul_g mul_a (us) 192 0.0080 0.0082 0.0748 30.61 0.000 0.00 0.663 0.810 1.007 43.9 189.9 224 0.0103 0.0108 0.1281 47.48 0.000 0.00 0.968 1.481 1.632 89.4 323.2 255 0.0095 0.0090 0.1545 18.71 0.000 0.00 0.828 1.081 1.085 90.8 284.9 256 0.1353 0.0123 0.1423 51.25 0.000 0.00 0.992 1.326 1.726 107.4 379.3 384 0.0174 0.0171 0.1253 108.82 0.000 0.00 1.760 2.233 2.991 302.9 1004.2 448 0.1382 0.1380 0.1865 147.87 0.000 0.00 3.167 4.161 4.211 738.6 1668.3 521 0.0186 0.0179 0.2750 216.30 0.000 0.00 3.438 4.249 5.756 743.5 2510.6
Do you think this is acceptable or need further optimization?
Regards,