Daiki Ueno ueno@gnu.org writes:
Thank you for the suggestions and sorry for the shameless delay.
No problem, I'm also quite slow. I think there are three main pieces left to integrate.
1. Curve operations to support Curve448 (i.e., diffie-hellman operations). I have made some progress, on my curve448 branch,
2. SHAKE 128/256. I think I had some question on the interface design.
3. EdDSA 448.
Optimization of the mod p arithmetic isn't that important yet, but I'll nevertheless try to explain how I think about it.
Due to my ignorance, I probably don't get what you mean, but as far as I read the implementations of other curves in Nettle, some of them seem to use the property of generalized Mersenne numbers described in: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.2133 and it's also applicable to the Curve448 prime.
The way I think about it, we have p = 2^448 - 2^224 - 1. This implies that
2^448 = 2^224 + 1 (mod p)
(since a = b (mod p) means that a - b is a multiple of p, and in this case, the difference between left and right hand is exactly p).
So a 896-bit product here:
a0 + a1 * 2^k + a2 * 2^2k + a3 * 2^3k (k = 224)
And in this notation, 2^2k = 2^k + 1 (mod p) and 2^3k = 2^2k + 2^k (mod p). If we use these relations to replace the high powers of two, we get precisely your formulas:
can be reduced to:
b0 + b1 * 2^k (mod p)
through modular additions of the smaller numbers:
b0 = a0 + a2 + a3 (mod p) b1 = a1 + a2 + a3 + a3 (mod p)
I'm not sure what's the best way to organize this. I suggested doing it in two steps, first reduce the high three chunks to two and leave a0 unchanged:
a1 + a2 2^k + a3 2^2k = (a1 + a3) + (a2 + a3) 2^k = b1 + b_2 2^k (mod p)
where we can do the carry propagation and carry wrap-around to get b0 and b1 as 224-bit numbers.
And then repeat the same procedure once more
a0 + b1 2^k + b2 2^2k = (a0 + b2) + (b1 + b2) 2^k = c0 + c1 2^k
This is going to work out nicely on 32-bit platforms. For 64-bit, one should get better performance the more shift operations can be avoided, and your variant might be faster then the above.
On the other hand, for assembly implementation keeping values in registers, the two step way needs fewer registers, since loading the lowest limbs can be postponed until after the higher order limbs are eliminated and no longer need any registers.
Looking at your formulas,
b0 = a0 + a2 + a3 (mod p) b1 = a1 + a2 + a3 + a3 (mod p)
I think it should be possible to implement with only a single 7-limb shift: In the b0 formula, only a3 needs shifting, a0 and a2 are already at the right position. And in the b1 formula, a1 and a2 are at the right position, but a2 needs a shift. I think I'd try the following:
1. Do the additions a0 += a2, a1 += a3, which I think can be done as a single 7-limb add, with plain carry propagation between the pieces. Carry out can be folded immediately, or saved for later.
2. Do the second a1 += a3 (needs some masking to extract the a3 bits, but no shift). Or maybe skip the masking, and handle the low 32 bits of a3 some other way.
3. Do an in-place 32-bit left shift of a2, a3.
4. Add the shifted terms, a0 += a3, a1 += a2. This gets even better if we make the a2, a3 shift above a rotate storing the shifted out high bits of a3 back at the position of the low bits of a2.
5. Do final carry handling.
There's no need to try to find the optimal reduction strategy right away, though, other missing pieces are more important.
I tried to implement it (as attached) and got 20-30% speed-up with mini-gmp.
For good test coverage of these functions, I recommend building with the real gmp (to get access to its mpz_rrandomb function), and then let
while NETTLE_TEST_SEED=0 ./ecc-mod-test ; do : ; done
run over night. And possibly also hacking ecc_mod_test.c to only test the curve of interest.
Regards, /Niels