nisse@lysator.liu.se (Niels Möller) writes:
Performance for the scalar multiplication primitives seem to be slower than secp384 and slightly faster than secp521, and looking at point addition, it's slower than secp521. I hope that will be improved a quite a bit with an optimized mod operation for the curve448 prime.
I've tried out this mod function (for 64-bit):
static void ecc_448_modp(const struct ecc_modulo *m, mp_limb_t *rp) { /* Let B = 2^64, b = 2^32 = sqrt(B). p = B^7 - b B^3 - 1 ==> B^7 = b B^3 + 1
{x_{13}, ..., x_0} = {x_6,...,x_0} + {x_{10},...,x_7} + 2 {x_{13},x_{12}, x_{11}} B^4 + b {x_{10},...,x_7,x_{13},x_{12},x_{11} */ mp_limb_t c3, c4, c7; mp_limb_t *tp = rp + 7;
c4 = mpn_add_n (rp, rp, rp + 7, 4); c7 = mpn_addmul_1 (rp + 4, rp + 11, 3, 2); c3 = mpn_addmul_1 (rp, rp + 11, 3, (mp_limb_t) 1 << 32); c7 += mpn_addmul_1 (rp + 3, rp + 7, 4, (mp_limb_t) 1 << 32); tp[0] = c7; tp[1] = tp[2] = 0; tp[3] = c3 + (c7 << 32); tp[4] = c4 + (c7 >> 32) + (tp[3] < c3); tp[5] = tp[6] = 0; c7 = mpn_add_n (rp, rp, tp, 7); c7 = cnd_add_n (c7, rp, m->B, 7); assert(c7 == 0); }
This gives a speedup of 85% over the general ecc_mod (on my machine), and gives about 35% speedup for scalar multiplication (both mul_g and mul_a). So with this change, performance of mul_g and mul_1 is roughly midway between secp384 and secp521.
Not sure if replacing the addmul_1 calls with shifts is worthwhile for the C code (we'll get more function calls and more passes over the data, which should still be worthwhile for machines with slow multiplication), but for assembly implementation, the addmul_1(..., 2) call should be adds only, in registers, and the addmul_1(,..., 1<<32) should be shift and add, preferably in registers.
I'm going to leave randomized testing running for a few hours.
Regards, /Niels