nettle-bugs April 2013

nettle-bugs@lists.lysator.liu.se

10 participants
20 discussions

Support for via quadcore SHA512 hw acceleration‏
by Shaun Murphy 21 Apr '13

21 Apr '13

The limited literature for the newer VIA QuadCore E-Series embedded platform says that it now supports "Secure Hash Algorithm: SHA-1, SHA-256, SHA-384, SHA-512" but I'm not seeing any acceleration for SHA512 in the kernel modules or nettle / gnutls. I would appreciate some pointers on what I need to do to access that SHA512 acceleration in nettle. Here's my setup:Via Artigo A1250 Ubuntu 12.04 x86_64Gnutls - built from git Nettle - built from 2.6 source Kernel modules: padlock_aes, padlock_sha Here's my dmesg output for the loaded modules:[ 2.345061] padlock_aes: Using VIA PadLock ACE for AES algorithm.[ 2.364105] padlock_sha: Using VIA PadLock ACE for SHA1/SHA256 algorithms. gnutls Benchmark Soft Ciphers:Checking SHA1 (16kb payload)... Processed 464.73 MB in 5.00 secs: 92.95 MB/secChecking SHA256 (16kb payload)... Processed 180.04 MB in 5.00 secs: 36.01 MB/secChecking SHA512 (16kb payload)... Processed 267.39 MB in 5.00 secs: 53.48 MB/sec gnutls Benchmark Ciphers:Checking SHA1 (16kb payload)... Processed 1.51 GB in 5.00 secs: 0.30 GB/secChecking SHA256 (16kb payload)... Processed 1.30 GB in 5.00 secs: 0.26 GB/secChecking SHA512 (16kb payload)... Processed 267.45 MB in 5.00 secs: 53.49 MB/sec The SHA256 numbers are great but I really need SHA512 for my application. Thank you.

2 1

[PATCH] Fix a typo in a comment
by Martin Storsjö 18 Apr '13

18 Apr '13

--- aclocal.m4 | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/aclocal.m4 b/aclocal.m4 index 0d24fc2..98b399b 100644 --- a/aclocal.m4 +++ b/aclocal.m4 @@ -610,8 +610,8 @@ AC_SUBST(EXEEXT_FOR_BUILD,$gmp_cv_prog_exeext_for_build) dnl NETTLE_CHECK_ARM_NEON dnl --------------------- -dnl Check if ARM Neon instructinos should be used. -dnl Obeys enable_arn_neon, which should be set earlier. +dnl Check if ARM Neon instructions should be used. +dnl Obeys enable_arm_neon, which should be set earlier. AC_DEFUN([NETTLE_CHECK_ARM_NEON], [if test "$enable_arm_neon" = auto ; then if test "$cross_compiling" = yes ; then -- 1.7.9.4

1 0

rename of salsa20r12
by Nikos Mavrogiannopoulos 17 Apr '13

17 Apr '13

The attached patch renames the salsa20r12_crypt function to estream_salsa20_crypt(), and adds it in the benchmarks. What is missing is an equivalent of x86_64/salsa20-crypt.asm for the estream variant. regards, Nikos

2 4

Micro optimizations of the umac context structs
by nisse＠lysator.liu.se 16 Apr '13

16 Apr '13

Speaking of umac, I'm also looking at the umac context structs, for potential micro optimizations and fixes before it becomes a part of the ABI. Some fields, like nonce_length, index, and (for umac32 and umac64) nonce_low, fit in 16 or even 8 bits. So it might make sense to make them adjacent. And on the other hand, the umac block count is currently unsigned, and will wraparound after 2*32 blocks or 2^42 bytes. Other hash functions typically support data sizes up to 2^64 (except sha512 which uses a 128-bit coutner, which seems gross overkill). For umac, the block counter is only needed to keep track of when to switch to different layer 2 hashing, and to keep track of odd and even blocks for poly128. So it could probably be made to work with only 16 bits and some saturation logic. But extending it to 64 bits seems simpler. It would also be nice if we could force 16-byte alignment for the l1_key array (this is important for assembly routines), which would them imply 16-byte alignment for the complete context struct. Could help x86 sse2 assembly. And could help also on ARM, but I'm not sure if the system (primarily linker and malloc) really makes 16-byte alignment possible there. And it would also be good to get a reasonably large alignment for the block buffer. In gcc, there's __attribute__ ((aligned (16))), but since this gets part of the ABI, we can't use it in public headers unless we can specify the same alignment for *all* reasonable compilers for the given architecture. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance.

2 5

Internal vs installed headers
by nisse＠lysator.liu.se 16 Apr '13

16 Apr '13

I'm considering moving some macros from macros.h (installad as <nettle/macros.h>) to nettle-internal.h (not installed). Only the various READ, WRITE and ROTL macros seem generally useful. Comments? I'm also looking at the current umac.h. It could be split into umac.h and umac-internal.h, but I'm not sure that's needed; the "internal" definitions are reasonably clean and seem unlikely to cause problems. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance.

2 3

umac
by nisse＠lysator.liu.se 12 Apr '13

12 Apr '13

I have now read the UMAC spec (RFC 4418) a bit more carefully. I haven't yet read Niko's code (or any other code, for that matter). Some thoughts: o I don't like the way endian conversion is done in the spec. I'd prefer to think about the various functions as operating on arrays of 32-bit words, and implementation should use integer types of the right size to get correct alignment etc. o The "NH" function looks like a candidate for for assembly implementation. I don't know if there's anything else in the algorithm which really is performance critical? (And here we get a contradiction to point (1), it may be best for performance to have the NH function get the unaligned byte array as input, do be able to use assembly tricks when reading it into integers. Anyway, we should really avoid byte arrays in the internal interfaces between L1/L2/L3). o *Maybe* optimization of the L2 and L3 hashes will be important. Profiling is needed, I guess, and they should be optimized *after* L1/NH. o Since I have been work with side-channel silence recently, it seems natural to try to make the POLY function silent, On the other hand, I'm not sure what the threats are. If the MAC is applied to a secret message, we may leak some information about the message, I guess? o I think we ought to handle large messages correctly, which means we need the POLY function also over the 128-bit prime. Performance is not terribly important, at least not initially. o I'm not sure exactly how the building blocks fit together, but we should strive for pipelining. When we have the first message block M_1, apply L1 to that block, then apply L2 and L3 to the output as soon as possible. And for the larger tag lengths, also try to make that looping inside the loop processing the sequence of message blocks, so we can discard M_1 before starting to work on the next block M_2. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance.

3 20

Building on AIX 6.1
by Perry Smith 10 Apr '13

10 Apr '13

Hi Guys, I got libnettle to build on AIX 6.1 but I had to edit the Makefile.in and the configure.ac file. Editing the configure.ac is fine. But I had to change the make rules so the shared library depended upon nettle_OBJS instead of nettle_PURE_OBJS. I *think* (but I'm not sure) that the .po suffix confuses the linker. It appears as if it was just ignoring all the .po files. I'm happy to post my diffs but they may not help much. I've never seen "pure" objects before. Is this really a useful concept? Thank you, Perry

2 4

Re: [PATCH] sha3: Correct _sha3_update for incremental hashing
by nisse＠lysator.liu.se 06 Apr '13

06 Apr '13

edgar.iglesias(a)gmail.com writes: > diff --git a/sha3.c b/sha3.c > index d7aec46..21e7beb 100644 > --- a/sha3.c > +++ b/sha3.c > @@ -61,7 +61,7 @@ _sha3_update (struct sha3_state *state, > if (pos > 0) > { > unsigned left = block_size - pos; > - if (length < pos) > + if (length < left) > { > memcpy (block + pos, data, length); > return pos + length; Thanks, checked in now. Unfortunately, testutils.c:test_hash doesn't exercise this logic. Regards, /Niels -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance.

1 0

ECC status update for February
by nisse＠lysator.liu.se 04 Apr '13

04 Apr '13

Here's a copy of the second status update, which I just submitted to Internetfonden. Regards, /Niels Nettle project funded by Internetfonden Status update for February 2013 * Summary New ECC code integrated in the Nettle repository. The code has been optimized with respect to both storage and computation requirements. * Activities Most of the time in February has been spent on the ECC code, both optimization, and integration work. The ECC code has been integrated into Nettle, including a preliminary high-level interface for ECDSA signatures. The code is available in the branch "ecc-support" in the repository at git://git.lysator.liu.se/nettle/nettle.git. Final days and first days of February have been used to implement the most important curve-specific functions in ARM assembly. Independently if the ECC support, also the function memxor has been optimized for ARM. During the month, roughly 120 working hours have been spent on the project. * Elliptic curves Point multiplication involving an arbitrary point now uses a side-channel silent window based algorithm. This gave a speedup of around 30% for ECDSA signature verification. For each curve, arithmetic on the coordinates is done modulo a curve specific prime number, e.g., p = 2^192 - 2^64 + 1 for the curve "secp-192r1", and p = 2^224 - 2^96 - 1 for the curve "secp-224r1". These primes all have special structure, which can be used to speed up the modp operation following each multiplication of two coordinates. Since the key operation is add with carry, which is poorly supported by the C programming language, writing these functions in assembly is attractive. Reducing a 384 bit number, typically the product of two 192-bit numbers, modulo the 192-bit prime above can be done with 12 add with carry instructions on the x86_64 architecture, or 26 on the 32-bit ARM. ARM assembly implementation gave a speedup of 2-4 times of the modulo operations, corresponding to a speedup around 50% for ECDSA sign and verify operations. The functions for doing ECC point addition have also been optimized to reduce the amount of temporary storage. E.g, with the current code, ECDSA signatures over the curve secp-256r1 requires 384 bytes of temporary storage for signing, and 2080 bytes to verify a signature. A few operations did not use side-channel silent algorithms earlier. These have been replaced by side-channel silent versions, which causes some slowdown. In particular, the side-channel silent modular inversion is very slow. Benchmarks, as of March 5, including changes relative the the numbers in the previous status report: Intel i5, 3.4 GHz: name size sign/ms verify/ms rsa 1024 6.3299 105.0161 rsa 2048 0.9573 29.5316 dsa 1024 11.1947 5.7647 ecdsa 192 18.1878 +1.4% 6.2035 +64% ecdsa 224 8.9302 -11% 2.8714 +41% ecdsa 256 8.1958 -13% 2.6707 +20% ecdsa 384 3.1515 -10% 0.9866 +25% ecdsa 521 1.8874 -13% 0.6858 +30% ecdsa (openssl) 224 3.4829 3.0458 ecdsa (openssl) 384 1.4516 1.2711 ecdsa (openssl) 521 0.6855 0.5831 ARM Cortex A9, 1 GHz: name size sign/ms verify/ms rsa 1024 0.2634 4.5464 rsa 2048 0.0392 1.2481 dsa 1024 0.4688 0.2381 ecdsa 192 1.2303 -4.5% 0.4318 +50% ecdsa 224 0.8526 +8.5% 0.3075 +72% ecdsa 256 0.6286 +6.5% 0.2243 +75% ecdsa 384 0.2532 +4.5% 0.0876 +61% ecdsa 521 0.1319 -0.6% 0.0448 +51% ecdsa (openssl) 224 0.1843 0.1563 ecdsa (openssl) 384 0.0693 0.0589 ecdsa (openssl) 521 0.0259 0.0214 So in these benchmarks, the net effect of the development is a great improvement of the speed of signature verification, with more mixed results for signature creation. For comparison, the benchmark also includes figures for the ECDSA functions provided by the OpenSSL library (for the three curves supported by both Nettle and OpenSSL). On x86_64, signing is 2.2 -- 2.8 times faster than OpenSSL, and verification is from 22% slower to 18% faster. On ARM, signature performance is 3.6 -- 5 times faster, and verify performance is 1.5 -- 2 times faster. Speaking of benchmarks, the ARM assembly for Nettle's memxor function, also developed during February, gave a speedup of 20% -- 50% depending on the input alignment. * Remaining tasks The ECC interface needs to be finalized and documented. There are always additional optimizations that are possible. Writing x86_64 assembly for the modulo functions is tempting, but of low priority within this project. As mentioned, the modular inversion is slow, with the current code, 20% -- 30% of the time to create an ECDSA signature is spent computing a modular inverse. This could be sped up by assembly implementation of the primitives this algorithm needs, or by writing the complete function in assembler. The optimization of other cryptographic primitives such as the AES cipher and the SHA256 hash function remains to do. It has turned out that some internal functions in the GMP library, used for arithmetic on larger numbers, would be useful for the ECC implementation. One possible direction is to extend the public GMP interface so these functions can be used. -- Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26. Internet email is subject to wholesale government surveillance.

1 1

Possible bug and patch for nettle
by Sarat Chandra Addepalli 01 Apr '13

01 Apr '13

1 0

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

nettle-bugs April 2013