Many thanks for your comments.
Nikos Mavrogiannopoulos n.mavrogiannopoulos@gmail.com writes:
On 01/17/2013 10:00 AM, Niels Möller wrote:
So first question is naming; what's the most established names for these curves?
Any would do. In any case both are used, so aliases from one to another should be good. In the context of TLS the SECP names are used.
I'll stick to secp names for now then. And I think aliases are for the documentation.
- For point multiplication with the fixed generator, I think I'll use the L=1 special case of Pippenger's algorithm, described in Daniel Bernstein's paper http://cr.yp.to/papers/pippenger.pdf. I think this is more or less the same as the "comb" method or "Lim-Lee"-method, but predating any bogus patents on the algorithm.
How does this algorithm compare to others?
If I read the gnutls code right, ecc_mulmod_cache makes the same number of ecc_projective_dbl_point calls as the non-cached version? If that is right, any comb-like algorithm should be able to beat it, since those algorithms reduce the number of point doublings and the number of iterations in the outer loop. Depending on selected table size, of course.
Do you have any ecc benchmarking code I could borrow?
The reason I insist on performance is that it would be hard for gnutls to switch to the new functions if they are significantly slower than the current ones,
I understand that. But there's going to be a some cost to make operations side-channel silent (a lot more on what I mean by that below).
- For general point multiplication, I tend towards using a simple
non-sliding window, with a small point-specific table computed at run time. If the table is small enough, it might make sense to normalize the entries. As far as I understand, one can't beat the plain binary algorithm by a large factor; the algorithms I'm aware of only reduce the number of adds, but not the number of dups.
I really cannot answer with any facts since I need to check the details, but I've not seen other efficient implementations using a non-sliding window for that.
gmp's mpz_powm_sec uses a non-sliding window (and other alternative algorithms to make running time depend only on input size, not on actual values; the most sever slowdown is for division, used when converting inputs to montgomery representation).
It seems to be 20% slower them mpz_powm for sizes comparable to ecc parameters (I tried with limbs size 16),
~/hack/gmp/tune$ ./speed -s16 -r mpz_powm mpz_powm_sec overhead 0.000000003 secs, precision 10000 units of 6.25e-10 secs, CPU freq 1600.00 MHz mpz_powm mpz_powm_sec 16 #0.000414344 1.2084
Not all of the slowdown are from the non-sliding window.
Also for that part you don't need constant time. [...] (in ECDH you also don't care unless it is fixed key ECDH).
In ECDH, any data dependent timing does leak some information about your secret exponent. What you're saying, if I understand you correctly, is that since you usually use each secret exponent only once, an attacker will in most cases not be able to collect enough information to do any damage.
Anyway, algorithm choice for this operation is not set in stone, and if we do version which doesn't promise constant time, there's more freedom in choosing algorithm.
Additions aren't that negligible. Also for that part you don't need constant time. In ECDSA you only multiply the fixed generator with the key (in ECDH you also don't care unless it is fixed key ECDH).
- For verifying ECDSA signatures, one needs to operate on two points, computing e_1 G + e_2 Y, where G is the generator (and hence offers possibilities for precomputation), and Y is the public key.
I would put that of lower priority that signing. Servers (where performance matters) rarely verify ECDSA keys.
Noted (although I'm not sure I agree it's less important than signing performance).
I have had a quick look at the gnutls code. wmNAF makes me a bit uneasy when I think about side channels; there seems to be a lot of branches depending on exponent bits.
Why is that? Did you see any issues on the timing resistant version of the function?
Let me first explain what I mean when I talk about a "side-channel silent" function. That means that if we call the function with operands which are of the same size, but otherwise different, the function should execute exactly the same sequence of instructions in both cases, and access memory in exactly the same pattern.
If one then assumes that the underlying machine instructions have data independent timing (true for most current cpus), we leak no side information from timing or cache behaviour. We may still leak information through power analysis, if, e.g., executing a multiplication instructions consumes different amount of energy depending on the input bit patterns.
Maybe this is unnecessarily strict (I haven't really thought about what the side-channel attack models are for particular cryptographic uses of ecc), but I think this is what one should aim for. And I think it's possible to achieve, at a moderate cost in performance.
I'm not intimately familiar with wmnaf, but it looks like the function ecc_wMNAF has a lot of branches depending on the exponent bits. I also suspect (but please correct me if I'm wrong) that the number of generated digits is not determined by the exponent size only, it also depends on the actual bits?
Next, in the main ecc_mulmod functions, the timing resistant versions calls the madd primitive the same number of times independent of the values of the digits. But there are some small remaining leaks (by the above pretty high standard):
1. If wmnaf_len depends on the exponent bit pattern, then so does the number of iterations through the loop.
2. Branches depending on whether digit is > 0, < 0 or == 0. Those will have slightly different timing. Since digit == 0 is unlikely, that branch will be badly predicted. And it also affects instruction cache.
3. For the data cache, accesses to the wmnaf table are obviously data dependent. Also, data is written to different locations in memory (R or T) depending on whether or not the current digit is non-zero.
In the point addition, it seems hard to avoid special cases when adding a point to itself or to its negation, or adding the zero point.
Do you mean the infinity points or the point (0,0,0)?
The infinity point, which is the unit element of the group. I denote it 0 below (P + 0 = P for all P).
The core operation of a typical point multiplication algorithm is
R += T[digit];
where R is where the result is accumulated, and T is a table indexed by current digit (plain window, wmnaf, pippenger, whatever). Now there are a couple of cases where this operation will not end up in the common-case point-addition code:
1. R = 0. If the scalar n is in the range 0 < n < group order, this should happen only for the very first iterations of the loop.
2. R = - T[digit] (so R = 0 on output). For the above range of n, should also not happen (except as R = T[digit] = 0 initially in the loop).
3. T[digit] = 0. This is the digit == 0 case. For best average performance, one wants to handle this case specially and not do any addition at all (and much of the "exponent recoding" tricks are aimed at making this a more common case).
But for constant-time, one wants to do an addition just as for any other value. I have to double check how jacobian coordinates work, but I think usual formulas for ecc point addition don't work in this case.
3. R = T[digit]. The point addition code must check for this case, and have a special case branching to point doubling instead. I'm not sure if this can happen; it clearly depends on which point multiplication algorithm is used. E.g, if the tabulated values in T are i*P for 0 <= i < M, and we know that R = j*P for some j in the range M <= j < prime group order, then this case can be excluded.
And in any case, I imagine it will be extremely unlikely for a random scalar.
Regards, /Niels