nisse@lysator.liu.se (Niels Möller) writes:
I think there are three main pieces left to integrate.
Curve operations to support Curve448 (i.e., diffie-hellman operations). I have made some progress, on my curve448 branch,
SHAKE 128/256. I think I had some question on the interface design.
EdDSA 448.
Optimization of the mod p arithmetic isn't that important yet,
I see. I thought that the performance of curve operations should at least be comparable to P-521. However, even with the generic ecc_mod for mod p, those are already close. So let's look at the above items first. I have rebased my patch implementing (1) on the curve448 branch: https://gitlab.com/dueno/nettle/commits/wip/dueno/curve448-2
One thing I noticed is that the point addition formula for untwisted curves doesn't look correct: https://gitlab.com/dueno/nettle/commit/4e3a50f4a50d8d03536dc107d7b77c84462e3...
but I'll nevertheless try to explain how I think about it.
Thank you for the detailed explanation. I ran the benchmark for those 3 variants: (1) the original version using ecc_mod, (2) the two step reduction as you suggest, and (3) my formula optimized with single 7-limbs operations:
size modp reduce modq modinv mi_gcd mi_pow dup_jj ad_jja ad_hhh mul_g mul_a (us) 448 0.0727 0.0720 0.0739 44.01 1.451 52.92 1.088 1.456 1.406 299.6 557.6 521 0.0139 0.0151 0.1003 77.72 1.703 101.59 0.728 0.995 1.277 255.8 588.4
448 0.0496 0.0497 0.0764 34.77 1.500 49.59 0.923 1.158 1.169 273.5 500.1 521 0.0147 0.0144 0.1027 77.63 1.816 88.57 0.716 0.934 1.276 237.2 589.9
448 0.0641 0.0644 0.0809 52.76 1.570 49.42 1.007 1.340 1.343 288.1 570.5 521 0.0139 0.0141 0.0967 78.22 1.697 99.44 0.714 1.012 1.264 235.8 589.2
on Core i7-6600U CPU @ 2.60GHz.
My code could be wrong or inefficient, but actually (2) is the fastest. (3) is slower due to the final carry handling; the carry is accumulated at most 3 and wrapping around it with cnd_add_n seems to be costly.
Regards,