Here's a copy of the second status update, which I just submitted to
Internetfonden.
Regards,
/Niels
Nettle project funded by Internetfonden
Status update for February 2013
* Summary
New ECC code integrated in the Nettle repository. The code has been
optimized with respect to both storage and computation requirements.
* Activities
Most of the time in February has been spent on the ECC code, both
optimization, and integration work. The ECC code has been integrated
into Nettle, including a preliminary high-level interface for ECDSA
signatures. The code is available in the branch "ecc-support" in the
repository at git://git.lysator.liu.se/nettle/nettle.git.
Final days and first days of February have been used to implement the
most important curve-specific functions in ARM assembly. Independently
if the ECC support, also the function memxor has been optimized for
ARM.
During the month, roughly 120 working hours have been spent on the
project.
* Elliptic curves
Point multiplication involving an arbitrary point now uses a
side-channel silent window based algorithm. This gave a speedup of
around 30% for ECDSA signature verification.
For each curve, arithmetic on the coordinates is done modulo a curve
specific prime number, e.g., p = 2^192 - 2^64 + 1 for the curve
"secp-192r1", and p = 2^224 - 2^96 - 1 for the curve "secp-224r1".
These primes all have special structure, which can be used to speed up
the modp operation following each multiplication of two coordinates.
Since the key operation is add with carry, which is poorly supported
by the C programming language, writing these functions in assembly is
attractive. Reducing a 384 bit number, typically the product of two
192-bit numbers, modulo the 192-bit prime above can be done with 12
add with carry instructions on the x86_64 architecture, or 26 on the
32-bit ARM.
ARM assembly implementation gave a speedup of 2-4 times of the modulo
operations, corresponding to a speedup around 50% for ECDSA sign and
verify operations.
The functions for doing ECC point addition have also been optimized to
reduce the amount of temporary storage. E.g, with the current code,
ECDSA signatures over the curve secp-256r1 requires 384 bytes of
temporary storage for signing, and 2080 bytes to verify a signature.
A few operations did not use side-channel silent algorithms earlier.
These have been replaced by side-channel silent versions, which causes
some slowdown. In particular, the side-channel silent modular
inversion is very slow.
Benchmarks, as of March 5, including changes relative the the numbers
in the previous status report:
Intel i5, 3.4 GHz:
name size sign/ms verify/ms
rsa 1024 6.3299 105.0161
rsa 2048 0.9573 29.5316
dsa 1024 11.1947 5.7647
ecdsa 192 18.1878 +1.4% 6.2035 +64%
ecdsa 224 8.9302 -11% 2.8714 +41%
ecdsa 256 8.1958 -13% 2.6707 +20%
ecdsa 384 3.1515 -10% 0.9866 +25%
ecdsa 521 1.8874 -13% 0.6858 +30%
ecdsa (openssl) 224 3.4829 3.0458
ecdsa (openssl) 384 1.4516 1.2711
ecdsa (openssl) 521 0.6855 0.5831
ARM Cortex A9, 1 GHz:
name size sign/ms verify/ms
rsa 1024 0.2634 4.5464
rsa 2048 0.0392 1.2481
dsa 1024 0.4688 0.2381
ecdsa 192 1.2303 -4.5% 0.4318 +50%
ecdsa 224 0.8526 +8.5% 0.3075 +72%
ecdsa 256 0.6286 +6.5% 0.2243 +75%
ecdsa 384 0.2532 +4.5% 0.0876 +61%
ecdsa 521 0.1319 -0.6% 0.0448 +51%
ecdsa (openssl) 224 0.1843 0.1563
ecdsa (openssl) 384 0.0693 0.0589
ecdsa (openssl) 521 0.0259 0.0214
So in these benchmarks, the net effect of the development is a great
improvement of the speed of signature verification, with more mixed
results for signature creation.
For comparison, the benchmark also includes figures for the ECDSA
functions provided by the OpenSSL library (for the three curves
supported by both Nettle and OpenSSL). On x86_64, signing is 2.2 --
2.8 times faster than OpenSSL, and verification is from 22% slower to
18% faster. On ARM, signature performance is 3.6 -- 5 times faster,
and verify performance is 1.5 -- 2 times faster.
Speaking of benchmarks, the ARM assembly for Nettle's memxor function,
also developed during February, gave a speedup of 20% -- 50% depending
on the input alignment.
* Remaining tasks
The ECC interface needs to be finalized and documented.
There are always additional optimizations that are possible. Writing
x86_64 assembly for the modulo functions is tempting, but of low
priority within this project. As mentioned, the modular inversion is
slow, with the current code, 20% -- 30% of the time to create an ECDSA
signature is spent computing a modular inverse. This could be sped up
by assembly implementation of the primitives this algorithm needs, or
by writing the complete function in assembler.
The optimization of other cryptographic primitives such as the AES
cipher and the SHA256 hash function remains to do.
It has turned out that some internal functions in the GMP library,
used for arithmetic on larger numbers, would be useful for the ECC
implementation. One possible direction is to extend the public GMP
interface so these functions can be used.
--
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.