Here's a copy of the second status update, which I just submitted to Internetfonden.
Regards, /Niels
Nettle project funded by Internetfonden
Status update for February 2013
* Summary
New ECC code integrated in the Nettle repository. The code has been optimized with respect to both storage and computation requirements.
* Activities
Most of the time in February has been spent on the ECC code, both optimization, and integration work. The ECC code has been integrated into Nettle, including a preliminary high-level interface for ECDSA signatures. The code is available in the branch "ecc-support" in the repository at git://git.lysator.liu.se/nettle/nettle.git.
Final days and first days of February have been used to implement the most important curve-specific functions in ARM assembly. Independently if the ECC support, also the function memxor has been optimized for ARM.
During the month, roughly 120 working hours have been spent on the project.
* Elliptic curves
Point multiplication involving an arbitrary point now uses a side-channel silent window based algorithm. This gave a speedup of around 30% for ECDSA signature verification.
For each curve, arithmetic on the coordinates is done modulo a curve specific prime number, e.g., p = 2^192 - 2^64 + 1 for the curve "secp-192r1", and p = 2^224 - 2^96 - 1 for the curve "secp-224r1". These primes all have special structure, which can be used to speed up the modp operation following each multiplication of two coordinates. Since the key operation is add with carry, which is poorly supported by the C programming language, writing these functions in assembly is attractive. Reducing a 384 bit number, typically the product of two 192-bit numbers, modulo the 192-bit prime above can be done with 12 add with carry instructions on the x86_64 architecture, or 26 on the 32-bit ARM.
ARM assembly implementation gave a speedup of 2-4 times of the modulo operations, corresponding to a speedup around 50% for ECDSA sign and verify operations.
The functions for doing ECC point addition have also been optimized to reduce the amount of temporary storage. E.g, with the current code, ECDSA signatures over the curve secp-256r1 requires 384 bytes of temporary storage for signing, and 2080 bytes to verify a signature.
A few operations did not use side-channel silent algorithms earlier. These have been replaced by side-channel silent versions, which causes some slowdown. In particular, the side-channel silent modular inversion is very slow.
Benchmarks, as of March 5, including changes relative the the numbers in the previous status report:
Intel i5, 3.4 GHz:
name size sign/ms verify/ms rsa 1024 6.3299 105.0161 rsa 2048 0.9573 29.5316 dsa 1024 11.1947 5.7647 ecdsa 192 18.1878 +1.4% 6.2035 +64% ecdsa 224 8.9302 -11% 2.8714 +41% ecdsa 256 8.1958 -13% 2.6707 +20% ecdsa 384 3.1515 -10% 0.9866 +25% ecdsa 521 1.8874 -13% 0.6858 +30% ecdsa (openssl) 224 3.4829 3.0458 ecdsa (openssl) 384 1.4516 1.2711 ecdsa (openssl) 521 0.6855 0.5831
ARM Cortex A9, 1 GHz:
name size sign/ms verify/ms rsa 1024 0.2634 4.5464 rsa 2048 0.0392 1.2481 dsa 1024 0.4688 0.2381 ecdsa 192 1.2303 -4.5% 0.4318 +50% ecdsa 224 0.8526 +8.5% 0.3075 +72% ecdsa 256 0.6286 +6.5% 0.2243 +75% ecdsa 384 0.2532 +4.5% 0.0876 +61% ecdsa 521 0.1319 -0.6% 0.0448 +51% ecdsa (openssl) 224 0.1843 0.1563 ecdsa (openssl) 384 0.0693 0.0589 ecdsa (openssl) 521 0.0259 0.0214
So in these benchmarks, the net effect of the development is a great improvement of the speed of signature verification, with more mixed results for signature creation.
For comparison, the benchmark also includes figures for the ECDSA functions provided by the OpenSSL library (for the three curves supported by both Nettle and OpenSSL). On x86_64, signing is 2.2 -- 2.8 times faster than OpenSSL, and verification is from 22% slower to 18% faster. On ARM, signature performance is 3.6 -- 5 times faster, and verify performance is 1.5 -- 2 times faster.
Speaking of benchmarks, the ARM assembly for Nettle's memxor function, also developed during February, gave a speedup of 20% -- 50% depending on the input alignment.
* Remaining tasks
The ECC interface needs to be finalized and documented.
There are always additional optimizations that are possible. Writing x86_64 assembly for the modulo functions is tempting, but of low priority within this project. As mentioned, the modular inversion is slow, with the current code, 20% -- 30% of the time to create an ECDSA signature is spent computing a modular inverse. This could be sped up by assembly implementation of the primitives this algorithm needs, or by writing the complete function in assembler.
The optimization of other cryptographic primitives such as the AES cipher and the SHA256 hash function remains to do.
It has turned out that some internal functions in the GMP library, used for arithmetic on larger numbers, would be useful for the ECC implementation. One possible direction is to extend the public GMP interface so these functions can be used.
Nettle project funded by Internetfonden
Status update for March 2013
* Summary
ARM optimizations of cryptographic primitives.
* Activities
The time in March have been spent mainly on ARM assembly implementation of various cryptographic primitives. The portability of the new ECC code has been improved, with contributions from Martin Storsjö. In the process, also the non-ARM code and the testsuite has seen some improvements.
The GNU GMP library is used by Nettle for general arithmetic on large integers. For example, each multiplication of two ECC coordinates is done as a call to a general multiplication routine in GMP, followed by a call to an optimized curve-specific function for reducing the product modulo a fixed prime p. During the month, some work has been done to make more low-level operations in GMP available for users of the library. The functions mpn_cnd_add_n and mpn_cnd_sub_n, for side-channel silent conditional addition and subtraction, are now available and documented, as well as some functions for easy conversion betwen GMP's higher-level "mpz" interface and the lower-level "mpn" interface.
On the GMP side, Torbjörn Granlund have been improving the ARM code. Relinking Nettle code with the development version of GMP also give nice performance improvements.
During March, 103 working hours have been spent on the project.
* Results
The "normal" ARM instruction set offers 16 general purpose registers, of 32 bits each. Some ARM processors, including the Cortex-A9, also offer an extension for single-instruction-multiple-data (SIMD). The ARM "Neon" instruction set is in some ways similar to the SSE2 instructions available on current x86_64 processors, but well designed and much easier to work with. It offers 16 additional registers of 128 bits each, with both integer operations on up to 64-bit quantities, and floating point operations (which are not used in Nettle).
Using Neon instructions gives a dramatic speedup for the SHA512 and SHA3 hash functions, which make heavy use of 64-bit operations. The ARM assembly code for the other algorithms don't make any use of Neon, but they have nevertheless beeen sped up compared to the C implementation, due to better register allocation, and tricks to use aligned reads for possibly unaligned input data.
There is certainly room for additional optimizations to the assembly code, in particular improving instruction scheduling.
* Benchmarks
For these cryptographic primitives, the numbers are in units of MByte/s, benchmarked on a 1GHz ARM Cortex-A9.
Algorithm Before After Speedup
memxor (aligned) 988 1906 93% memxor (unaligned) 511 638 25% aes-128 17.3 21.9 27% aes-192 14.5 16.7 15% aes-256 12.5 16.1 29% salsa20 39.9 58.1 46% sha1 55.7 60.7 9% sha256 24.6 31.7 29% sha512 7.8 30.4 290% sha3-224 5.77 27.5 377% sha3-256 5.45 26.0 377% sha3-384 4.18 20.0 378% sha3-512 2.90 13.9 379%
Public key operations:
name size sign/ms verify/ms rsa 1024 0.5014 ( +90%) 9.1323 (+101%) rsa 2048 0.0835 (+113%) 2.6942 (+116%) dsa 1024 0.9857 (+110%) 0.5026 (+111%) ecdsa 192 1.4312 ( +16%) 0.5530 ( +28%) ecdsa 224 1.0072 ( +18%) 0.4037 ( +31%) ecdsa 256 0.7846 ( +25%) 0.3094 ( +38%) ecdsa 384 0.3308 ( +31%) 0.1307 ( +49%) ecdsa 521 0.1823 ( +38%) 0.0719 ( +60%) ecdsa (openssl) 224 0.1842 0.1545 ecdsa (openssl) 384 0.0695 0.0590 ecdsa (openssl) 521 0.0261 0.0216
The improvements here are due to GMP work.
* Remaining tasks
The most important remaining task is documentation and release work. Plan and progress for the release is maintained at http://www.lysator.liu.se/~nisse/nettle/plan.html.
nettle-bugs@lists.lysator.liu.se