I've started to read up on ECC again (and I've also spent some time reading up on ARM architecture, but that's for a different message).
As I wrote earlier, for a start I'll limit the support to standard NIST curves. So first question is naming; what's the most established names for these curves? It seems gnutls uses, e.g., "SECP256R1", I guess we can just follow that (rather than, e.g, "P-256", as it's called in the DSA spec fips-186-3)?
My current plan is as follows:
1. The top-level functions for ECDSA and the like will use mpz_t to represent point coordinates, exponents, etc, just like the current DSA code.
2. Internally, I think I want to use GMP's lower-level mpn interface, for two reasons: (i) that should make it possible, with some effort, to use ecc functions without any allocation inside Nettle. (ii) For performance, I'd like to do specific mod p functions for the various curves, and this is done best on top of functions such as mpn_add_n and mpn_addmul_1. After all, the primes p used for the curves are carefully chosen to have special structure, to enable optimizations.
3. Representation of a curve will for now be an opaque struct. The header will just declare
struct ecc_curve; extern const struct ecc_curve nettle secp256r1;
Tables for point multiplication will be compiled into the library. They will be somewhat machine dependent, depending on the GMP limb size, so they have to be generated at build time.
4. I think I'll use Jacobian coordinates. These have a penalty for addition of arbitrary points, but we can try to use multiplication algorithms where one of the points being added always is a constant, which can be represented in normalized form. Then there's no penalty over plain homogeneous coordinates (this is the operation called "madd" in gnutls sources).
I've been intrigued by the recent (2007) breakthrough of Edwards curves, but as far as I understand, that can't easily be applied to the standard NIST curves. See http://cr.yp.to/newelliptic/newelliptic.html.
5. For point multiplication with the fixed generator, I think I'll use the L=1 special case of Pippenger's algorithm, described in Daniel Bernstein's paper http://cr.yp.to/papers/pippenger.pdf. I think this is more or less the same as the "comb" method or "Lim-Lee"-method, but predating any bogus patents on the algorithm.
(I can try to summarize the algorithm later, but it's pretty simple once one gets past the notation).
I think it's fairly easy to silence with regards to timing and cache side-channels. One nice feature is that even with pretty large tables, not all of the table index bits are data-dependent (extracted from the exponent). Then it should be practical to read all possible table entries and use masking to copy the selected one, see http://gmplib.org:8000/gmp/file/eee6b8e54765/mpn/generic/tabselect.c.
This operation is the working horse for ECDSA signing, and for the first half of ECDH.
I'm not sure how large tables it makes sense to create for each curve, but I guess somewehere between a few KByte to at most 100 Kbyte. This could be made configurable at compile time, but I'd expect almost all users to stick to the default tables.
6. For general point multiplication, I tend towards using a simple non-sliding window, with a small point-specific table computed at run time. If the table is small enough, it might make sense to normalize the entries. As far as I understand, one can't beat the plain binary algorithm by a large factor; the algorithms I'm aware of only reduce the number of adds, but not the number of dups.
This operation is the working horse for the second half of ECDH.
7. For verifying ECDSA signatures, one needs to operate on two points, computing e_1 G + e_2 Y, where G is the generator (and hence offers possibilities for precomputation), and Y is the public key.
I haven't yet looked into this in detail, but it seems clear this should be done as a single operation, with only one chain of point duplications. Preferably, it should be implemented in such a way that the tables for G (from 5 above) can be reused.
Comments? Does this make sense, or is it anything which ought to be done differently?
I have had a quick look at the gnutls code. wmNAF makes me a bit uneasy when I think about side channels; there seems to be a lot of branches depending on exponent bits.
In the point addition, it seems hard to avoid special cases when adding a point to itself or to its negation, or adding the zero point. Which will cause sidechannel leaks. Not sure what to do about that, but there are some papers on "unified" ecc operations, or maybe one can do something analogous to rsa blinding.
As far as I understand, for point multiplication in both ECDSA and ECDH, it's only the scalar that ever is secret, the points multiplied are always public. Are there other uses of point multiplication where also the point is secret?
Regards, /Niels
On 01/17/2013 10:00 AM, Niels Möller wrote:
As I wrote earlier, for a start I'll limit the support to standard NIST curves. So first question is naming; what's the most established names for these curves? It seems gnutls uses, e.g., "SECP256R1", I guess we can just follow that (rather than, e.g, "P-256", as it's called in the DSA spec fips-186-3)?
Any would do. In any case both are used, so aliases from one to another should be good. In the context of TLS the SECP names are used.
- For point multiplication with the fixed generator, I think I'll use the L=1 special case of Pippenger's algorithm, described in Daniel Bernstein's paper http://cr.yp.to/papers/pippenger.pdf. I think this is more or less the same as the "comb" method or "Lim-Lee"-method, but predating any bogus patents on the algorithm.
How does this algorithm compare to others? The reason I insist on performance is that it would be hard for gnutls to switch to the new functions if they are significantly slower than the current ones, and I'd like to switch eventually to nettle for all operations including ECC.
- For general point multiplication, I tend towards using a simple
non-sliding window, with a small point-specific table computed at run time. If the table is small enough, it might make sense to normalize the entries. As far as I understand, one can't beat the plain binary algorithm by a large factor; the algorithms I'm aware of only reduce the number of adds, but not the number of dups.
I really cannot answer with any facts since I need to check the details, but I've not seen other efficient implementations using a non-sliding window for that. Are you sure that the effect is insignificant? Additions aren't that negligible. Also for that part you don't need constant time. In ECDSA you only multiply the fixed generator with the key (in ECDH you also don't care unless it is fixed key ECDH).
- For verifying ECDSA signatures, one needs to operate on two points, computing e_1 G + e_2 Y, where G is the generator (and hence offers possibilities for precomputation), and Y is the public key.
I would put that of lower priority that signing. Servers (where performance matters) rarely verify ECDSA keys.
I have had a quick look at the gnutls code. wmNAF makes me a bit uneasy when I think about side channels; there seems to be a lot of branches depending on exponent bits.
Why is that? Did you see any issues on the timing resistant version of the function?
In the point addition, it seems hard to avoid special cases when adding a point to itself or to its negation, or adding the zero point.
Do you mean the infinity points or the point (0,0,0)? I think it can be catched by checking whether a point lies on a curve prior to the verification operation (0,0,0 isn't a point in the projective space either).
As far as I understand, for point multiplication in both ECDSA and ECDH, it's only the scalar that ever is secret, the points multiplied are always public.
Yes.
Are there other uses of point multiplication where also the point is secret?
None that I'm aware of in ECDH or ECDSA.
regards, Nikos
On 01/17/2013 08:54 PM, Nikos Mavrogiannopoulos wrote:
- For general point multiplication, I tend towards using a simple non-sliding window, with a small point-specific table computed at run time. If the table is small enough, it might make sense to normalize the entries. As far as I understand, one can't beat the plain binary algorithm by a large factor; the algorithms I'm aware of only reduce the number of adds, but not the number of dups.
btw. moving from the sliding window version (the one I submitted last year) to wmnaf, gave 7-16% performance boost in transaction/sec on ECDH/ECDSA depending on the size of the curve. So given that the sliding window method is faster than the simple, it could be quite a difference.
regards, Nikos
Many thanks for your comments.
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
On 01/17/2013 10:00 AM, Niels Möller wrote:
So first question is naming; what's the most established names for these curves?
Any would do. In any case both are used, so aliases from one to another should be good. In the context of TLS the SECP names are used.
I'll stick to secp names for now then. And I think aliases are for the documentation.
- For point multiplication with the fixed generator, I think I'll use the L=1 special case of Pippenger's algorithm, described in Daniel Bernstein's paper http://cr.yp.to/papers/pippenger.pdf. I think this is more or less the same as the "comb" method or "Lim-Lee"-method, but predating any bogus patents on the algorithm.
How does this algorithm compare to others?
If I read the gnutls code right, ecc_mulmod_cache makes the same number of ecc_projective_dbl_point calls as the non-cached version? If that is right, any comb-like algorithm should be able to beat it, since those algorithms reduce the number of point doublings and the number of iterations in the outer loop. Depending on selected table size, of course.
Do you have any ecc benchmarking code I could borrow?
The reason I insist on performance is that it would be hard for gnutls to switch to the new functions if they are significantly slower than the current ones,
I understand that. But there's going to be a some cost to make operations side-channel silent (a lot more on what I mean by that below).
- For general point multiplication, I tend towards using a simple
non-sliding window, with a small point-specific table computed at run time. If the table is small enough, it might make sense to normalize the entries. As far as I understand, one can't beat the plain binary algorithm by a large factor; the algorithms I'm aware of only reduce the number of adds, but not the number of dups.
I really cannot answer with any facts since I need to check the details, but I've not seen other efficient implementations using a non-sliding window for that.
gmp's mpz_powm_sec uses a non-sliding window (and other alternative algorithms to make running time depend only on input size, not on actual values; the most sever slowdown is for division, used when converting inputs to montgomery representation).
It seems to be 20% slower them mpz_powm for sizes comparable to ecc parameters (I tried with limbs size 16),
~/hack/gmp/tune$ ./speed -s16 -r mpz_powm mpz_powm_sec overhead 0.000000003 secs, precision 10000 units of 6.25e-10 secs, CPU freq 1600.00 MHz mpz_powm mpz_powm_sec 16 #0.000414344 1.2084
Not all of the slowdown are from the non-sliding window.
Also for that part you don't need constant time. [...] (in ECDH you also don't care unless it is fixed key ECDH).
In ECDH, any data dependent timing does leak some information about your secret exponent. What you're saying, if I understand you correctly, is that since you usually use each secret exponent only once, an attacker will in most cases not be able to collect enough information to do any damage.
Anyway, algorithm choice for this operation is not set in stone, and if we do version which doesn't promise constant time, there's more freedom in choosing algorithm.
Additions aren't that negligible. Also for that part you don't need constant time. In ECDSA you only multiply the fixed generator with the key (in ECDH you also don't care unless it is fixed key ECDH).
- For verifying ECDSA signatures, one needs to operate on two points, computing e_1 G + e_2 Y, where G is the generator (and hence offers possibilities for precomputation), and Y is the public key.
I would put that of lower priority that signing. Servers (where performance matters) rarely verify ECDSA keys.
Noted (although I'm not sure I agree it's less important than signing performance).
I have had a quick look at the gnutls code. wmNAF makes me a bit uneasy when I think about side channels; there seems to be a lot of branches depending on exponent bits.
Why is that? Did you see any issues on the timing resistant version of the function?
Let me first explain what I mean when I talk about a "side-channel silent" function. That means that if we call the function with operands which are of the same size, but otherwise different, the function should execute exactly the same sequence of instructions in both cases, and access memory in exactly the same pattern.
If one then assumes that the underlying machine instructions have data independent timing (true for most current cpus), we leak no side information from timing or cache behaviour. We may still leak information through power analysis, if, e.g., executing a multiplication instructions consumes different amount of energy depending on the input bit patterns.
Maybe this is unnecessarily strict (I haven't really thought about what the side-channel attack models are for particular cryptographic uses of ecc), but I think this is what one should aim for. And I think it's possible to achieve, at a moderate cost in performance.
I'm not intimately familiar with wmnaf, but it looks like the function ecc_wMNAF has a lot of branches depending on the exponent bits. I also suspect (but please correct me if I'm wrong) that the number of generated digits is not determined by the exponent size only, it also depends on the actual bits?
Next, in the main ecc_mulmod functions, the timing resistant versions calls the madd primitive the same number of times independent of the values of the digits. But there are some small remaining leaks (by the above pretty high standard):
1. If wmnaf_len depends on the exponent bit pattern, then so does the number of iterations through the loop.
2. Branches depending on whether digit is > 0, < 0 or == 0. Those will have slightly different timing. Since digit == 0 is unlikely, that branch will be badly predicted. And it also affects instruction cache.
3. For the data cache, accesses to the wmnaf table are obviously data dependent. Also, data is written to different locations in memory (R or T) depending on whether or not the current digit is non-zero.
In the point addition, it seems hard to avoid special cases when adding a point to itself or to its negation, or adding the zero point.
Do you mean the infinity points or the point (0,0,0)?
The infinity point, which is the unit element of the group. I denote it 0 below (P + 0 = P for all P).
The core operation of a typical point multiplication algorithm is
R += T[digit];
where R is where the result is accumulated, and T is a table indexed by current digit (plain window, wmnaf, pippenger, whatever). Now there are a couple of cases where this operation will not end up in the common-case point-addition code:
1. R = 0. If the scalar n is in the range 0 < n < group order, this should happen only for the very first iterations of the loop.
2. R = - T[digit] (so R = 0 on output). For the above range of n, should also not happen (except as R = T[digit] = 0 initially in the loop).
3. T[digit] = 0. This is the digit == 0 case. For best average performance, one wants to handle this case specially and not do any addition at all (and much of the "exponent recoding" tricks are aimed at making this a more common case).
But for constant-time, one wants to do an addition just as for any other value. I have to double check how jacobian coordinates work, but I think usual formulas for ecc point addition don't work in this case.
3. R = T[digit]. The point addition code must check for this case, and have a special case branching to point doubling instead. I'm not sure if this can happen; it clearly depends on which point multiplication algorithm is used. E.g, if the tabulated values in T are i*P for 0 <= i < M, and we know that R = j*P for some j in the range M <= j < prime group order, then this case can be excluded.
And in any case, I imagine it will be extremely unlikely for a random scalar.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
Let me first explain what I mean when I talk about a "side-channel silent" function. That means that if we call the function with operands which are of the same size, but otherwise different, the function should execute exactly the same sequence of instructions in both cases, and access memory in exactly the same pattern.
To give a flavor of what this means, here's some code (totally untested) I wrote earlier today. It's basic binary point multiplication, but intended to have running time independent of the exponent bits.
Regards, /Niels
/* nettle, low-level cryptographics library * * Copyright (C) 2013 Niels Möller * * The nettle library is free software; you can redistribute it and/or modify * it under the terms of the GNU Lesser General Public License as published by * the Free Software Foundation; either version 2.1 of the License, or (at your * option) any later version. * * The nettle library is distributed in the hope that it will be useful, but * WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY * or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public * License for more details. * * You should have received a copy of the GNU Lesser General Public License * along with the nettle library; see the file COPYING.LIB. If not, write to * the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, * MA 02111-1301, USA. */
/* Development of Nettle's ECC support was funded by Internetfonden. */
#include <assert.h>
#include "ecc.h"
void cnd_copy (int cnd, mp_limb_t *rp, const mp_limb_t *ap, mp_size_t n) { mp_limb_t mask, keep; mp_size_t i;
mask = -(mp_limb_t) (cnd !=0); keep = ~mask;
for (i = 0; i < n; i++) rp[i] = (rp[i] & keep) + (ap[i] & mask); }
void ecc_mul_binary (const struct ecc_curve *ecc, mp_limb_t *r, const mp_limb_t *np, const mp_limb_t *p) { mp_limb_t tp[9*ecc->size]; mp_limb_t *rj = tp + 3*ecc->size; mp_limb_t *pj = tp + 6*ecc->size; int is_zero;
unsigned i;
mpn_zero (rj, 3*ecc->size);
/* Extend p to jacobian coordinates, with z = 1 */ mpn_copyi (pj, p, 2*ecc->size); pj[2*ecc->size] = 1; mpn_zero (pj + 2*ecc->size+1, ecc->size - 1);
for (i = ecc->size, is_zero = 1; i-- > 0; ) { mp_limb_t w = np[i]; mp_limb_t bit;
for (bit = (mp_limb_t) 1 << (GMP_NUMB_BITS - 1); bit > 0; bit >>= 1) { int digit; ecc_dup_jj (ecc, rj, rj); ecc_add_jja (ecc, tp, rj, p);
digit = (w & bit) > 0; /* If is_zero is set, rj is the zero point, and ecc_add_jja produced garbage. */ cnd_copy (is_zero, tp, pj, 3*ecc->size); is_zero &= ~digit; /* If we had a one-bit, use the sum. */ cnd_copy (digit, rj, tp, 3*ecc->size); } } ecc_normalize (ecc, r, rj); }
On Fri, Jan 18, 2013 at 9:35 AM, Niels Möller nisse@lysator.liu.se wrote:
How does this algorithm compare to others?
If I read the gnutls code right, ecc_mulmod_cache makes the same number of ecc_projective_dbl_point calls as the non-cached version? If that is right, any comb-like algorithm should be able to beat it, since those algorithms reduce the number of point doublings and the number of iterations in the outer loop. Depending on selected table size, of course.
In the main loop there is no difference in the wmnaf code, but outside the loop the cached version does one doubling less and several additions less.
Do you have any ecc benchmarking code I could borrow?
I use gnutls-cli's benchmark-tls-kx option. It measures TLS key exchange time (one of them is ECDHE-ECDSA using an 192-bit curve).
Also for that part you don't need constant time. [...] (in ECDH you also don't care unless it is fixed key ECDH).
In ECDH, any data dependent timing does leak some information about your secret exponent. What you're saying, if I understand you correctly, is that since you usually use each secret exponent only once, an attacker will in most cases not be able to collect enough information to do any damage.
Yes. That's the way ECDHE is used in gnutls. No key is reused and the timing information from a single session isn't sufficient to recover it.
I would put that of lower priority that signing. Servers (where performance matters) rarely verify ECDSA keys.
Noted (although I'm not sure I agree it's less important than signing performance).
Sorry, indeed, that's for TLS only. In TLS usually only the server is authenticated. If there is client authentication too, then this is also important.
Why is that? Did you see any issues on the timing resistant version of the function?
Let me first explain what I mean when I talk about a "side-channel silent" function. That means that if we call the function with operands which are of the same size, but otherwise different, the function should execute exactly the same sequence of instructions in both cases, and access memory in exactly the same pattern.
If one then assumes that the underlying machine instructions have data independent timing (true for most current cpus), we leak no side information from timing or cache behaviour. We may still leak information through power analysis, if, e.g., executing a multiplication instructions consumes different amount of energy depending on the input bit patterns.
You are correct, but power analysis is useful for dedicated devices such as smart cards. A generic-purpose library will not be used there anyway. Protection from timing analysis is important, and while it is not precisely measured, the gnutls wmnaf code contains counter-measures for that. If you check the main loop in ecc_mulmod_cached_timing() there is an "else" case where a dummy addition is being performed. That is to make the number of operations used independent of the input data.
I'm not intimately familiar with wmnaf, but it looks like the function ecc_wMNAF has a lot of branches depending on the exponent bits.
Could be, but ecc_wMNAF is a fixed conversion that doesn't depend on any input from an attacker. That is the best you can do is associate the key with a time needed for its wmNAF conversion. I don't think this is better than the association with its public key.
Next, in the main ecc_mulmod functions, the timing resistant versions calls the madd primitive the same number of times independent of the values of the digits. But there are some small remaining leaks (by the above pretty high standard):
- If wmnaf_len depends on the exponent bit pattern, then so does the number of iterations through the loop.
This is a fixed association again. An attack cannot control wmnaf_len or the number of iterations. Timing attacks usually work by the attacker trying different inputs and measuring the different timings. In the cases you describe, as far as I understand, different input will not provide additional timing information.
- Branches depending on whether digit is > 0, < 0 or == 0. Those will have slightly different timing. Since digit == 0 is unlikely, that branch will be badly predicted. And it also affects instruction cache.
- For the data cache, accesses to the wmnaf table are obviously data dependent. Also, data is written to different locations in memory (R or T) depending on whether or not the current digit is non-zero.
Could be, these are very cpu dependent. If your argumentation is that a really safe timing resistant multiplication version has to be used, then I wouldn't disagree, if it's up to the level that is not so slow that no-one uses it. However, there is no reason for the code that does not need to be timing resistant, not to be as fast as possible.
In the point addition, it seems hard to avoid special cases when adding a point to itself or to its negation, or adding the zero point.
Do you mean the infinity points or the point (0,0,0)?
The infinity point, which is the unit element of the group. I denote it 0 below (P + 0 = P for all P).
I've seen some implementations do not handle that at all, but I agree that it has to be handled properly.
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
On Fri, Jan 18, 2013 at 9:35 AM, Niels Möller nisse@lysator.liu.se wrote:
- If wmnaf_len depends on the exponent bit pattern, then so does the number of iterations through the loop.
This is a fixed association again.
Now I'm getting confused. In ECDSA signing, the point multiplication is
k * G
where k is a nonce (used only once), and G is the (public) group generator. If you are saying that leaking a little information about k via the timing of this multiplication is no problem, then the conclusion seem to be that, for ECDSA, there's no need whatsoever to make the point multiplication timing resistant? Right?
What is the main argument here, that the attacker has no control over k, or that k is used only once?
For each signature (r, s) on hash h using secret key z, the attacker gets to observe the values
h, r, s, with s = (h + z r)/k
I think there *is* a problem with leaking just a little bit of information about each k, since z = (k s - h) / r, and hence every piece of information about k implies a piece of information about z, and the information about z accumulates the more signatures you get.
And this issues is the same for plain DSA. Daniel Bleichenbacher's attack on biased generation of k, from back in 2001, could be related, even though you would get a different type of partial information about k from timing-attacks.
If it's not already done, this type of timing attack on DSA should be a great topic for a side-channel cryptanalysis paper (I'm not really following current cryptanalysis research, unfortyunately).
Regards, /Niels
On Wed, Jan 23, 2013 at 12:45 PM, Niels Möller nisse@lysator.liu.se wrote:
This is a fixed association again.
Now I'm getting confused. In ECDSA signing, the point multiplication is k * G where k is a nonce (used only once), and G is the (public) group generator. If you are saying that leaking a little information about k via the timing of this multiplication is no problem, then the conclusion seem to be that, for ECDSA, there's no need whatsoever to make the point multiplication timing resistant? Right?
My comments were not for ECDSA specifically. ECDSA is pretty fragile. k although used once per signature, if known, it can be used to obtain the (long term) private key. Because it is the private key you want to protect, any value that may leak info on that should be protected as well.
What is the main argument here, that the attacker has no control over k, or that k is used only once?
It is used only once and for this reason the attacker can only get a single timing for it. However, because the security of the ECDSA private key depends on that value, I would reasonably protect it.
I think there *is* a problem with leaking just a little bit of information about each k, since z = (k s - h) / r, and hence every piece of information about k implies a piece of information about z, and the information about z accumulates the more signatures you get.
Could be. Unfortunately unlike RSA there is not much on timing attacks on (EC)DSA or preventions (or at least known to me).
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
My comments were not for ECDSA specifically. ECDSA is pretty fragile.
For me, ECDSA is the primary application of elliptic curves. So then it seems important to use a point multiplication k * G which has a running time independent of the bits in k. That's why I find the method used in gnutls a bit worrying.
Could be. Unfortunately unlike RSA there is not much on timing attacks on (EC)DSA or preventions (or at least known to me).
Maybe one could get some insights from Bleichenbacher's paper. I have not read up on the details, but as far as I understand, the point was that k in the range 1 <= k < 2^bit_size - q were generated with a higher probability density than numbers in the range 2^bit_size - q < k < q.
And for a wnaf timing attack, one idea would be to divide signatures into two classes, those generated with a k expanding to n wnaf digits, and those generated with a k expanding to n+1 digits. Each class corresponds to a different set of possible k, and if the classification is not perfect, you still have a bias in each guessed class.
And I don't think prevention is too difficult. Just write the algorithm to not include any branches depending on bits in k, and preferably (for side-channels related to caches) also with all memory accesses idnependent of the k bits. There's going to be some performance penalty, of course.
Regards, /Niels
On 01/23/2013 04:37 PM, Niels Möller wrote:
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
My comments were not for ECDSA specifically. ECDSA is pretty fragile.
For me, ECDSA is the primary application of elliptic curves. So then it seems important to use a point multiplication k * G which has a running time independent of the bits in k. That's why I find the method used in gnutls a bit worrying.
Not for TLS. In TLS ECDHE is the primary application of elliptic curves. In the last SSL observatory data the RSA keys on the internet were 4 million+, whereas the ECDSA keys were only 6 (that's three years ago, but I don't think there was a radical change since then). Typically one uses ECDHE with RSA keys in TLS.
Nevertheless, the method that you use for the timing sensitive parts of the code, doesn't need to match the optimized version.
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
Not for TLS. In TLS ECDHE is the primary application of elliptic curves. [...] Typically one uses ECDHE with RSA keys in TLS.
I see. What's the motivation, saving cycles in the key exchange, or smaller messages, or higher security, or something else?
Nevertheless, the method that you use for the timing sensitive parts of the code, doesn't need to match the optimized version.
If the performance penalty for the timing-resistant functions is more than, say, 10%, it would make some sense to include a seperate set of functions optimized for speed. But I think I want to do the timing-resistant things first.
Here are some benchmarks for my first working version (the unit is operations per millisecond):
On matrix (Intel i5, 3.4 GHz):
name size sign / ms verify / ms rsa 1024 6.3145 104.0742 rsa 2048 0.9555 29.4275 dsa 1024 11.1746 5.7460 ecdsa 192 2.1167 1.5355 ecdsa 224 1.2371 1.0234 ecdsa 256 1.3845 1.0182 ecdsa 384 0.6415 0.4751 ecdsa 521 0.2515 0.2037
On pandix (ARM Cortex A9, 1 GHz):
name size sign / ms verify / ms rsa 1024 0.2626 4.5527 rsa 2048 0.0392 1.2490 dsa 1024 0.4694 0.2377 ecdsa 192 0.1906 0.1445 ecdsa 224 0.1418 0.1058 ecdsa 256 0.1050 0.0796 ecdsa 384 0.0431 0.0327 ecdsa 521 0.0173 0.0135
For the verify operation, RSA is by far fastest. I benchmark RSA using a small public exponent, 65537, which I think is typical.
For signing, plain DSA is currently fastest (I'd like to also include "DSA2", with 2048 bit p and 256 bit q, but my closest key generation programs didn't support that).
Still, signing with 256-bit ecdsa is a bit faster than 2048 bit RSA.
When optimizing, I think one can aim for a factor 10 speedup for signing and maybe a factor 2 for verification (the higher potential for signing is that it uses point multiplication only with the fixed generator, which can be sped up a lot using some precomputed tables. Current code uses the binary algorithm I posted the other day, and no serious curve-specific optimizations.
I could set up a public repository for my work-in-progress code, if there's some interest.
Regards, /Niels
On Thu, Jan 24, 2013 at 8:38 AM, Niels Möller nisse@lysator.liu.se wrote:
Not for TLS. In TLS ECDHE is the primary application of elliptic curves. [...] Typically one uses ECDHE with RSA keys in TLS.
I see. What's the motivation, saving cycles in the key exchange, or smaller messages, or higher security, or something else?
If perfect forward secrecy is required then DHE is pretty inefficient on security levels of 96 bits or more. ECDHE provides a fast equivalent.
On matrix (Intel i5, 3.4 GHz): name size sign / ms verify / ms rsa 1024 6.3145 104.0742 rsa 2048 0.9555 29.4275 dsa 1024 11.1746 5.7460 ecdsa 192 2.1167 1.5355 ecdsa 224 1.2371 1.0234 ecdsa 256 1.3845 1.0182 ecdsa 384 0.6415 0.4751 ecdsa 521 0.2515 0.2037
I usually use the table of ECRYPT II, to compare equivalent security levels. http://www.keylength.com/en/3/ Otherwise the comparison may be unfair to ECDSA. RSA-1024 is used much even today but it is roughly equivalent to an ECDSA key of 144 bits.
For signing, plain DSA is currently fastest (I'd like to also include "DSA2", with 2048 bit p and 256 bit q, but my closest key generation programs didn't support that).
certtool does that.
I could set up a public repository for my work-in-progress code, if there's some interest.
Why not on the main repository as a branch?
regards, Nikos
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
Why not on the main repository as a branch?
I already started working with a separate repository, and I'd like to have overall structure decided adding any files to the main repo. And I see no real gain from using branches as long as the sets of files are disjoint.
Regards, /Niels
"NM" == Niels Möller nisse@lysator.liu.se writes:
NM> So first question is naming; what's the most established names for NM> these curves? It seems gnutls uses, e.g., "SECP256R1", I guess we NM> can just follow that (rather than, e.g, "P-256", as it's called in NM> the DSA spec fips-186-3)?
RFC 6239 (Suite B Crypto Suites for SSH) has this to say about naming:
,----< excerpt from rfc6239.txt > | Curve NIST name SECG name OID [SEC2] | --------------------------------------------------------------- | P-256 nistp256 secp256r1 1.2.840.10045.3.1.7 | P-384 nistp384 secp384r1 1.3.132.0.34 `----
(Based on [SEC2], I presume the next line would be:
?-??? nistp521 secp521r1 1.3.132.0.35 )
So it looks liek the secp...r1 names are as good as any.
-JimC
nettle-bugs@lists.lysator.liu.se