It's some time since I posted benchmark figures. The speed dropped a bit when I started to use the slow side-channel silent modinv, but I think it's reasonable. For comparison, I just added openssl's ecdsa functions to my benchmark program. These are the current figures for x86_64:
size modp redc modq add_jja dup_jj (us) 192 0.0248 0.0265 0.0450 0.6319 0.5151 224 0.0519 0.0467 0.0680 0.9136 0.7149 256 0.0834 0.0409 0.0802 0.8222 0.6617 384 0.0927 0.0000 0.0607 1.6663 1.2559 521 0.0347 0.0514 0.1289 1.5239 1.1670
name size sign / ms verify / ms rsa 1024 6.3180 102.7909 rsa 2048 0.9562 29.2339 dsa 1024 11.1780 5.7492 ecdsa 192 14.9983 4.5612 ecdsa 224 8.0952 2.6880 ecdsa 256 7.7640 2.5785 ecdsa 384 2.9131 0.9155 ecdsa 521 1.7974 0.6594 ecdsa (openssl) 224 3.4939 3.0582 ecdsa (openssl) 384 1.4637 1.2603 ecdsa (openssl) 521 0.6962 0.5981
If we extract common functions for openssl (it seemed to lack secp192r1 and secp256r1), we get
ecdsa 224 8.0952 2.6880 ecdsa (openssl) 224 3.4939 3.0582
ecdsa 384 2.9131 0.9155 ecdsa (openssl) 384 1.4637 1.2603
ecdsa 521 1.7974 0.6594 ecdsa (openssl) 521 0.6962 0.5981
So it looks like for signing, the current code beats openssl by a factor of two. While for verify, we're a little behind. And my code tries hard to be side-channel silent (even for verify, where it doesn't matter). I'm not sure if openssl tries to be side-channel silent.
Regards, /Niels
ons 2013-02-13 klockan 16:34 +0100 skrev Niels Möller:
While for verify, we're a little behind. And my code tries hard to be side-channel silent (even for verify, where it doesn't matter).
Why is that? Is it because you re-use code that is also used by signing? Maybe it makes sense to implement the time consuming functions in a side-channel leaky (but faster) way for use with verify? It will make the code somewhat bigger, but I'm not sure anyone cares.
Btw, it would be nice to compare with GnuTLS' ECDSA as well, it contains some nice optimizations.
/Simon
Simon Josefsson simon@josefsson.org writes:
Why is that? Is it because you re-use code that is also used by signing?
Exactly, for now I stick to using the same primitives.
Maybe it makes sense to implement the time consuming functions in a side-channel leaky (but faster) way for use with verify? It will make the code somewhat bigger, but I'm not sure anyone cares.
It would certainly make sense to use separate functions for verifying (or other computations on public values only), but I think I'd like to integrate the current code in Nettle first. And there are other possible optimizations too, so I think one should go for the lowest hanging fruit first.
Btw, it would be nice to compare with GnuTLS' ECDSA as well, it contains some nice optimizations.
Do you have an example on how to do that? Corresponding to the (quite ugly) openssl interface at https://www.openssl.org/docs/crypto/ecdsa.html, including an almost working example. I have to admit that I'm not very familiar with gnutls, so I'm probably not looking at the right places.
Regards, /Niels
On 02/13/2013 04:34 PM, Niels Möller wrote:
So it looks like for signing, the current code beats openssl by a factor of two. While for verify, we're a little behind.
How can it be like that? Signing requires: modulo n, 1 inversion, 2 multiplications, and 1 addition, as well as 1 multiplication with the EC generator point.
The verification requires: modulo n, 1 inversion, 2 multiplications, 1 multiplication with a random elliptic curve point, 1 multiplication with the EC generator point and an elliptic curve addition.
If nettle is faster on signing it means (if we ignore operations mod n), that EC multiplication with the curve generator point is 2x faster on nettle.
The fact that openssl is a bit faster on verification it would mean that nettle is really slower than openssl in multiplying a random point, and adding points on the curve. Could that be?
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
If nettle is faster on signing it means (if we ignore operations mod n), that EC multiplication with the curve generator point is 2x faster on nettle.
Probably a bit more than a factor of two, since the silent modular inversion is likely much slower than openssl's.
The fact that openssl is a bit faster on verification it would mean that nettle is really slower than openssl in multiplying a random point, and adding points on the curve. Could that be?
The point multiplication with the random point takes much longer time than the point multiplication involving the generator. So if we have, say, 40% slowdown for the random point, and 200% speedup for the generator, the net result could still be 20% slowdown. (I made those figures up, but I think they are in the right ballpark).
I guess the penalty for side-channel silence is generally some 10-30%. Except for modinv, where it's even worse.
The reason we get pretty good performance for signing is that I use the Pippenger/comb algorithm for multiplication involving the generator. With current parameters, there's roughly 16 KByte of data for each curve, and multiplication takes n/6 dups and n/6 adds, where n is the bit size. While algorithms like wmNAF may use even fewer adds, but it still needs n dups. And for all the adds, one point is precomputed, so in Jacobian coordinates we have Z=1.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
The point multiplication with the random point takes much longer time than the point multiplication involving the generator.
I added some more functions to the benchmarking (at the cost of a bit too long lines):
size modp redc modq modinv dup_jj add_jja add_jjj mul_g mul_a (us) 192 0.0250 0.0261 0.0442 13.7320 0.4994 0.6127 0.6138 37.7651 147.9455 224 0.0546 0.0435 0.0665 22.8033 0.7316 0.9063 0.9075 75.1715 256.2235 256 0.0714 0.0391 0.0798 23.1066 0.6377 0.7910 0.7912 80.2991 254.6769 384 0.0834 0.0000 0.0614 47.6998 1.2428 1.6319 1.6311 241.0215 734.5506 521 0.0344 0.0533 0.1296 105.0090 1.1316 1.4764 1.4775 343.3341 915.2812
So mul_a appears to be about 3 times slower than mul_g. And modinv is awful, at 1/3 of the time of a mul_g, signing will spend 1/4 of the time in modinv. (I have some ideas on how to simplify modinv and make it a little less slow).
(BTW, the 0.0 for 384-bit redc just means that no redc function is implemented for that prime. It's more of an anomaly that modp is slower than modq; that's an opportunity for optimization).
Regards, /Niels
On 02/14/2013 09:16 AM, Niels Möller wrote:
The point multiplication with the random point takes much longer time than the point multiplication involving the generator.
I added some more functions to the benchmarking (at the cost of a bit too long lines):
size modp redc modq modinv dup_jj add_jja add_jjj mul_g mul_a (us) 192 0.0250 0.0261 0.0442 13.7320 0.4994 0.6127 0.6138 37.7651 147.9455 224 0.0546 0.0435 0.0665 22.8033 0.7316 0.9063 0.9075 75.1715 256.2235 256 0.0714 0.0391 0.0798 23.1066 0.6377 0.7910 0.7912 80.2991 254.6769 384 0.0834 0.0000 0.0614 47.6998 1.2428 1.6319 1.6311 241.0215 734.5506 521 0.0344 0.0533 0.1296 105.0090 1.1316 1.4764 1.4775 343.3341 915.2812
So mul_a appears to be about 3 times slower than mul_g. And modinv is awful, at 1/3 of the time of a mul_g, signing will spend 1/4 of the time in modinv. (I have some ideas on how to simplify modinv and make it a
little less slow).
That's pretty good table. It would be nice to have a comparison of modinv with mpz_invert as a baseline (you also don't need to use a timing resistant modinv during verification).
How hard could it be to add the wmNAF multiplication from ecc_mulmod.c in gnutls to this list for comparison? If it is much faster than mul_a, then it would be a good candidate for the multiplication needed in DH (which doesn't need to be side-channel resistant).
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
That's pretty good table. It would be nice to have a comparison of modinv with mpz_invert as a baseline (you also don't need to use a timing resistant modinv during verification).
That would make sense. On the mpn level, the corresponding GMP function is mpn_gcdext.
How hard could it be to add the wmNAF multiplication from ecc_mulmod.c in gnutls to this list for comparison?
Do you have some example code to use this gnutls function? Then I'm afraid it might also be a bit tricky to get linking right if we want to have it all in the same benchmark executable.
If it is much faster than mul_a, then it would be a good candidate for the multiplication needed in DH (which doesn't need to be side-channel resistant).
There are two potential gains from using gnutls code or something similar: (i) Fewer point additions, due to more clever window hadling and/or exponent recoding. (ii) The add and dup primitives could be sped up a little if they're not required to be side-channel silent.
Regards, /Niels
On 02/17/2013 09:47 AM, Niels Möller wrote:
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
That's pretty good table. It would be nice to have a comparison of modinv with mpz_invert as a baseline (you also don't need to use a timing resistant modinv during verification).
That would make sense. On the mpn level, the corresponding GMP function is mpn_gcdext.
How hard could it be to add the wmNAF multiplication from ecc_mulmod.c in gnutls to this list for comparison?
Do you have some example code to use this gnutls function? Then I'm afraid it might also be a bit tricky to get linking right if we want to have it all in the same benchmark executable.
This is not exported from gnutls (we only export the high level API). I meant copying it from there with the functions it depends on (they aren't be many) and adding it to the test you have.
regards, Nikos
nettle-bugs@lists.lysator.liu.se