nettle-bugs February 2022

nettle-bugs@lists.lysator.liu.se

5 participants
7 discussions

Add RSA-OAEP encryption/decryption to Nettle
by Nicolas Mora 09 Mar '24

09 Mar '24

Hello, I've made a new Merge Request in the nettle gitlab repo to provide RSA-OAEP encryption and decryption: https://git.lysator.liu.se/nettle/nettle/-/merge_requests/20 It adds 2 new functions: int pkcs1_oaep_encrypt (size_t key_size, void *random_ctx, nettle_random_func *random, size_t hlen, size_t label_length, const uint8_t *label, size_t message_length, const uint8_t *message, mpz_t m); int pkcs1_oaep_decrypt (size_t key_size, const mpz_t m, size_t hlen, size_t label_length, const uint8_t *label, size_t *length, uint8_t *message); The parameter hlen is the output length of the SHA function used for masking data: - SHA1_DIGEST_SIZE - SHA256_DIGEST_SIZE - SHA384_DIGEST_SIZE - SHA512_DIGEST_SIZE Is it possible to get feedback for this MR and eventually push it to the master branch? Thanks in advance /Nicolas

4 21

x86_64 gcm
by nisse＠lysator.liu.se 23 Feb '22

23 Feb '22

Hi, I've written a first version of a gcm_hash for x86_64, using the pclmulqdq (carryless mul) instructions. With only a single block at a time, no interleaving, this gives to 4.3 GByte/s, 0.5 cycles per byte on my laptop, one pclmulqdq every second cycle. If we could sustain one mul instruction per cycle, by interleaving, we could perhaps increase performance by another factor of two. See below. Configure options and fat setup still missing. Regards, /Niels C x86_64/gcm-hash.asm ifelse(` Copyright (C) 2022 Niels Möller This file is part of GNU Nettle. GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either: * the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version. or * the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. or both in parallel, as here. GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ') C Common registers define(`KEY', `%rdi') define(`P', `%xmm0') define(`BSWAP', `%xmm1') define(`H', `%xmm2') define(`D', `%xmm3') define(`T', `%xmm4') C void gcm_init_key (union gcm_block *table) PROLOGUE(_nettle_gcm_init_key) define(`MASK', `%xmm5') movdqa .Lpolynomial(%rip), P movdqa .Lbswap(%rip), BSWAP movups 2048(KEY), H C Middle element pshufb BSWAP, H C Multiply by x mod P, which is a left shift. movdqa H, T psllq $1, T psrlq $63, H C 127 --> 64, 63 --> 0 pshufd $0xaa, H, MASK C 64 --> (96, 64, 32, 0) pslldq $8, H C 0 --> 64 por T, H pxor T, T psubd MASK, T C All-ones if bit 127 was set pand P, T pxor T, H movups H, (KEY) C Set D = x^{-64} H = {H0, H1} + P1 H0 pshufd $0x4e, H, D C Swap H0, H1 pclmullqhqdq P, H pxor H, D movups D, 16(KEY) ret undefine(`MASK') EPILOGUE(_nettle_gcm_init_key) C Use pclmulqdq, doing one 64x64 --> 127 bit carry-less multiplication, C with source operands being selected from the halves of two 128-bit registers. C Variants: C pclmullqlqdq low half of both src and destination C pclmulhqlqdq low half of src register, high half of dst register C pclmullqhqdq high half of src register, low half of dst register C pclmulhqhqdq high half of both src and destination C To do a single block, M0, M1, we need to compute C C R = M0 D1 + M1 H1 C F = M0 D0 + M1 H0 C C Corresponding to x^{-127} M H = R + x^{-64} F C C Split F as F = F1 + x^64 F0, then the final reduction is C C R + x^{-64} F = R + P1 F0 + x^{64} F0 + F1 C C In all, 5 pclmulqdq. If we we have enough registers to interleave two blocks, C final reduction is needed only once, so 9 pclmulqdq for two blocks, etc. C C We need one register each for D and H, one for P1, one each for accumulating F C and R. That uses 5 out of the 16 available xmm registers. If we interleave C blocks, we need additionan D ang H registers (for powers of the key) and the C additional message word, but we could perhaps interlave as many as 4, with two C registers left for temporaries. define(`X', `%rsi') define(`LENGTH', `%rdx') define(`DATA', `%rcx') define(`R', `%xmm5') define(`M', `%xmm6') define(`F', `%xmm7') C void gcm_hash (const struct gcm_key *key, union gcm_block *x, C size_t length, const uint8_t *data) PROLOGUE(_nettle_gcm_hash) movdqa .Lpolynomial(%rip), P movdqa .Lbswap(%rip), BSWAP movups (KEY), H movups 16(KEY), D movups (X), R pshufb BSWAP, R sub $16, LENGTH jc .Lfinal .Loop: movups (DATA), M pshufb BSWAP, M .Lblock: pxor M, R movdqa R, M movdqa R, F movdqa R, T pclmullqlqdq D, F C D0 * M0 pclmullqhqdq D, R C D1 * M0 pclmulhqlqdq H, T C H0 * M1 pclmulhqhqdq H, M C H1 * M1 pxor T, F pxor M, R pshufd $0x4e, F, T C Swap halves of F pxor T, R pclmullqhqdq P, F pxor F, R add $16, DATA sub $16, LENGTH jnc .Loop .Lfinal: add $16, LENGTH jnz .Lpartial pshufb BSWAP, R movups R, (X) ret .Lpartial: C Copy zero padded to stack mov %rsp, %r8 sub $16, %rsp pxor M, M movups M, (%rsp) .Lread_loop: movb (DATA), %al sub $1, %r8 movb %al, (%r8) add $1, DATA sub $1, LENGTH jnz .Lread_loop C Move into M register, jump into loop with LENGTH = 0 movups (%rsp), M add $16, %rsp jmp .Lblock EPILOGUE(_nettle_gcm_hash) RODATA C The GCM polynomial is x^{128} + x^7 + x^2 + x + 1, C but in bit-reversed representation, that is C P = x^{128}+ x^{127} + x^{126} + x^{121} + 1 C We will mainly use the middle part, C P1 = (P + a + x^{128}) / x^64 = x^{563} + x^{62} + x^{57} ALIGN(16) .Lpolynomial: .byte 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0xC2 .Lbswap: .byte 15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0 -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

2 3

Poly1305 based on radix 2^44 for S390x and PowerPC
by Maamoun TK 23 Feb '22

23 Feb '22

POWER9 and Z14 have introduced ' vmsumudm' and 'vmslg' instructions respectively for full 64-bit vector multiplication. This note demonstrates implementing Poly1305 based on radix 2^44 for optimal architecture utilization. In radix B = 2^44 we have the state and key as follows H = B^2 H_2 + B H_1 + H_0 R = B^2 R_2 + B R_1 + R_0 Where degrees of H_2 and R_2 are 42 and 36 respectively whereas other coefficients are of degree 44 We have B^3 = 20 so we can pre-compute R_1' = 20 * R_1 and R_2' = 20 * R_2 Now to multiply H by R we can write R H = B^2(R_2 H_0 + R_1 H_1 + R_0 H_2) + B(R_1 H_0 + R_0 H_1 + R_2' H_2) \____________T2__________/ \____________T1__________/ + R_0 H_0 + R_2' H_1 + R_1' H_2 \____________T0___________/ T_0 < 2^88 + 2^85 + 2^91 = 2^93 T_1 < 2^88 + 2^88 + 2^83 = 2^90 T_2 < 2^80 + 2^88 + 2^86 = 2^90 This is a good place to combine the interleaved multiplication state products of next consecutive blocks (In this case we interleave four-block multiplications by per-computing powers of key R) which rises up the degree of product parts. T_0 < 2^93 + 3*2^93 = 2^96 T_1 < 2^90 + 3*2^91 = 2^94 T_2 < 2^90 + 3*2^90 = 2^93 Now let's reduce this product to 2^130 divided to 2^44, 2^44, 2^42 for each part. T_0 ------------> T_1 ---------------> T_2 -------------------> T_0 ---------------> T_1 (96-44) (94-44+1) (93-42+1+3) (55-44+1) This chain keeps the carry addition for the last part (2^44+2^12) less than 2^44+2^22. Note the carry from T_2 would be multiplied by 5 before adding it to T_0 While the sequential carry addition achieves decent performance speed on PowerPC arch, an interleaved carry handling would get more speed up for s390x so let's figure an interleaved variant of carry addition. *----------------------------------------------------------------------------------------------* | Phase 1 | Phase 2 | Phase 3 | |------------------------------|---------------------------------|-------------------------------| | T_1 ------------> T_2 | T_2 -----------------> T_0 | T_0 --------------> T_1 | | (94-44) | (93-42+1+3) | (55-44+1) | |------------------------------|---------------------------------|-------------------------------| | T_0 -------------> T_1 | T_1 --------------> T_2 | | | (96-44) | (52-44+1) | | *-----------------------------------------------------------------------------------------------* This variant implies 3 sequential phases rather than four. Moreover, the long carry path from T_2 -> T_0 would've executed in parallel with another carry path. Patches of implementation based on radix 2^44 for both architectures are submitted to nettle with fat build support. https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41 Benchmark the implementations on POWER9 and Z15 *---------------------------------------------------------------------------------------------* | Arch | Nettle patch (radix 2^44) | OpenSSL (radix 2^26) | |-------------------|--------------------------------------|-----------------------------------| | PowerPC | 0.972 cycles/byte | 1.453 cycles/byte | |-------------------|--------------------------------------|-----------------------------------| | IBMz | 0.840 cycles/byte | 0.936 cycles/byte | *---------------------------------------------------------------------------------------------* This note can also be applied on x86_64 arch with 'AVX512VL' and 'AVX512_IFMA' extension support by utilizing 'VPMADD52HUQ' and 'VPMADD52LUQ' instructions of full 52-bit multiplication. regards, Mamone

2 2

[PATCH v2 0/7] Introduce SM4 symmetric cipher algorithm
by Tianjia Zhang 21 Feb '22

21 Feb '22

SM4 is a block cipher standard published by the government of the People's Republic of China, and it was issued by the State Cryptography Administration on March 21, 2012. The standard is GM/T 0002-2012 "SM4 block cipher algorithm". SM4 algorithm is a symmetric cipher algorithm in ShangMi cryptosystems. The block length and key length are both 128 bits. Both the encryption algorithm and the key derivation algorithm use 32 rounds of non-linear iterative structure, and the S box is a fixed 8 bits. The RFC 8998 specification defines the usage of ShangMi algorithm suite in TLS 1.3, etc. According to the State Cryptography Administration of China, its security and efficiency are equivalent to AES-128. Reference specification: 1. http://www.gmbz.org.cn/upload/2018-04-04/1522788048733065051.pdf 2. http://gmbz.org.cn/main/viewfile/20180108015408199368.html 3. https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html 4. https://datatracker.ietf.org/doc/html/rfc8998 --- v2 changes: - use separate set_key functions to avoid two copies of the subkeys. - unify encryption and decryption operations with one function. - use unsigned type instead of uint32_t for loop counter i. - use four variables instead of four5-element array. Tianjia Zhang (7): doc: Add Copyright of SM3 hash algorithm Introduce SM4 symmetric cipher algorithm testsuite: add test for SM4 symmetric algorithm nettle-benchmark: bench SM4 symmetric algorithm doc: documentation for SM4 cipher algorithm gcm: Add SM4 as the GCM underlying cipher doc: documentation for GCM using SM4 cipher Makefile.in | 2 + examples/nettle-benchmark.c | 2 + gcm-sm4-meta.c | 60 ++++++++++ gcm-sm4.c | 81 +++++++++++++ gcm.h | 25 +++- nettle-meta-aeads.c | 1 + nettle-meta-ciphers.c | 1 + nettle-meta.h | 3 + nettle.texinfo | 81 +++++++++++++ sm4-meta.c | 49 ++++++++ sm4.c | 223 +++++++++++++++++++++++++++++++++++ sm4.h | 69 +++++++++++ testsuite/.gitignore | 1 + testsuite/Makefile.in | 2 +- testsuite/gcm-test.c | 18 +++ testsuite/meta-aead-test.c | 1 + testsuite/meta-cipher-test.c | 3 +- testsuite/sm4-test.c | 19 +++ 18 files changed, 638 insertions(+), 3 deletions(-) create mode 100644 gcm-sm4-meta.c create mode 100644 gcm-sm4.c create mode 100644 sm4-meta.c create mode 100644 sm4.c create mode 100644 sm4.h create mode 100644 testsuite/sm4-test.c -- 2.34.1

1 7

[PATCH 0/7] Introduce SM4 symmetric cipher algorithm
by Tianjia Zhang 21 Feb '22

21 Feb '22

SM4 is a block cipher standard published by the government of the People's Republic of China, and it was issued by the State Cryptography Administration on March 21, 2012. The standard is GM/T 0002-2012 "SM4 block cipher algorithm". SM4 algorithm is a symmetric cipher algorithm in ShangMi cryptosystems. The block length and key length are both 128 bits. Both the encryption algorithm and the key derivation algorithm use 32 rounds of non-linear iterative structure, and the S box is a fixed 8 bits. The RFC 8998 specification defines the usage of ShangMi algorithm suite in TLS 1.3, etc. According to the State Cryptography Administration of China, its security and efficiency are equivalent to AES-128. Reference specification: 1. http://www.gmbz.org.cn/upload/2018-04-04/1522788048733065051.pdf 2. http://gmbz.org.cn/main/viewfile/20180108015408199368.html 3. https://tools.ietf.org/id/draft-ribose-cfrg-sm4-10.html 4. https://datatracker.ietf.org/doc/html/rfc8998 Tianjia Zhang (7): doc: Add Copyright of SM3 hash algorithm Introduce SM4 symmetric cipher algorithm testsuite: add test for SM4 symmetric algorithm nettle-benchmark: bench SM4 symmetric algorithm doc: documentation for SM4 cipher algorithm gcm: Add SM4 as the GCM underlying cipher doc: documentation for GCM using SM4 cipher Makefile.in | 2 + examples/nettle-benchmark.c | 2 + gcm-sm4-meta.c | 60 ++++++++++ gcm-sm4.c | 81 +++++++++++++ gcm.h | 25 +++- nettle-meta-aeads.c | 1 + nettle-meta-ciphers.c | 1 + nettle-meta.h | 3 + nettle.texinfo | 81 +++++++++++++ sm4-meta.c | 49 ++++++++ sm4.c | 225 +++++++++++++++++++++++++++++++++++ sm4.h | 71 +++++++++++ testsuite/.gitignore | 1 + testsuite/Makefile.in | 2 +- testsuite/gcm-test.c | 18 +++ testsuite/meta-aead-test.c | 1 + testsuite/meta-cipher-test.c | 3 +- testsuite/sm4-test.c | 19 +++ 18 files changed, 642 insertions(+), 3 deletions(-) create mode 100644 gcm-sm4-meta.c create mode 100644 gcm-sm4.c create mode 100644 sm4-meta.c create mode 100644 sm4.c create mode 100644 sm4.h create mode 100644 testsuite/sm4-test.c -- 2.32.0

2 9

Feature request: OCB mode
by Justus Winter 16 Feb '22

16 Feb '22

Hello, we (Sequoia PGP) would love to see OCB being implemented in Nettle. The OpenPGP working group is working on a revision of RFC4880, which will mostly be a cryptographic refresh, and will bring AEAD to OpenPGP. The previous -now abandoned- draft called for EAX being mandatory, and OCB being optional [0]. This was motivated by OCB being encumbered by patents. However, said patents were waived by the holder [1]. 0: https://datatracker.ietf.org/doc/html/draft-ietf-openpgp-rfc4880bis-10#sect… 1: https://mailarchive.ietf.org/arch/msg/cfrg/qLTveWOdTJcLn4HP3ev-vrj05Vg/ With OCB being no longer patent-encumbered, it seems preferable over the two-pass EAX construction. Therefore, it seems plausible that the WG makes OCB mandatory to implement. To support that in Sequoia, we'd need support for that in Nettle (Nettle is our main cryptographic backend). Unfortunately, we don't have the expertise in our team to contribute a patch, and we currently aren't in a position to offer funding for the implementation. Thanks, Justus

2 5

Latency in polynomial evaluation
by nisse＠lysator.liu.se 07 Feb '22

07 Feb '22

Hi, I've been thinking a bit more on the structure of polynomial evaluation, which at a high level is rather similar for ghash and gcm. ** Intro ** The function to be computed is R_j = K (R_{j-1} + M_j) where K is the secret key and M_j are the message blocks. Operations take place in some finite field. With n message blocks, M_0, ..., M_{n-1}, and initial R_{-1} = 0, we get R_{n-1} = m_{n-1} K + m_{n-2} K^2 + ... m_0 K^n I.e., a degree n polynomial with coefficients M_j, constant term = 0, evaluated at the point K. To be concrete, consider a 64-bit architecture, and a finite field where elements are represented as two words (for poly1305, we need two words plus a few extra bits, but ignore that now, for simplicity. And for ghash, let's also ignore the complications from bit-reversal). Let B represent the bignum base. I'm going to make this a bit handwavy, but we'll have B = 2^64 or B = x^64 depending on type of field. The finite field is represented by a mod P operation, where the structure of P is nice, with a leading one bit followed by more than 64 zeros, and then a few more non-zero bits at the end. This implies that we can define a multiply operation Y_2 B^2 + Y_1 B + Y_0 = (X_1 B + X_0) (K_1 B + K_0) (mod P) by four independent independent multiplication instructions (widening, 64x64 --> 128) involving X and some precomputed values depending on K, and accumulation only involving shifts/adds that have low latency. The result is not fully reduced mod P; it consists of three words. We can then reduce this to two words by multiplying Y_2 by a suitable single-word constant, with final two-word result R = C Y_2 + Y_1 B + Y_0 Applying this to the original recurrency, R_j = K (R_{j-1} + M_j), the X input corresponds to R_{j-1} + M_j, so the critical dependency path from R_{j-1} to R_j includes *two* multiply latencies. E.g, if multiply latency is 5 cycles, it's not possible to get this evaluation scheme to run faster than 10 cycles per block. (In practice, the accumulation will also contribute one or a few cycles to the critical path, but *much* less than those two multiplies). So question is, how can we do better? ** Postponing reduction ** The approach taken in the new x86_64 poly1305 code I just pushed is to skip the final reduction, and let the state be one word larger (and this is particularly cheap for poly1305, because we don't quite increase state size by a word, but from two words + 3 bits to two words plus ~60 bits). The multiply operation becomes Y_2 B^2 + Y_1 B + Y_0 = (X_2 B^2 + X_1 B + X_0) (K_1 B + K_0) (mod P) This can be arranged with 6 independent multiply instructions + cheap accumulation. (I haven't worked out the details for the ghash case, but I do expect that it's rather practival there too). Then the dependency chain from one block to the next is reduced to one multiply latency, 5 cycles in our example. In case all other needed instructions can be scheduled (manually, or by the processor's out-of-order machinery) to run in 5 cycles in parallel with the multiplies, we would get a running time of 5 cycles per block. ** Interleaving ** The other approach, used in the recent powerpc gcm code, is to interleave multiple blocks. For simplicity, only consider 2-way interleaving here. The key thing is that if we expand te recurrency once, we get R_j = K (M_j + K (R_{j-2} + M_{j-1})) = K M_j + K^2 (R_{j-2} + M_{j-1}) We get two field multiplications, but one of them, K M_j, is completely independent of previous blocks (R_{j-2}), and can be computed in parallel. It may add a cycle or so to accumulation latency, but we can do essentially twice as much work without making the critical path longer. We get 8 independent multiply instructions, and one dependent for the final folding. Can be extended to more than two blocks if needed (depending on number of available registers). Another variant could be to separate even and odd parts of the polynomial being evaluated, and evaluate both parts at K^2. We can then compute the two recurrencies E_j = K^2 (E_{j-1} + M_{2j}) O_j = K^2 (O_{j-1} + M_{2j+1}) in parallel. It's unclear to me what the pros and cons are compared to previous variant. One may get some advantage from both multiplies using the same factor K^2. On the other hand, each recurrency has to be accumulated and folded separately, which costs instructions and registers. Maybe more useful for hardware implementation? This variant is currently not used in Nettle. ** Doing both ** It's possible to combine those two tricks. Processing of two blocks would then be an operation of the form Z_2 B^2 + Z_1 B + Z_0 = (X_2 B^2 + X_1 B + X_0) K^2 + (Y_1 B + Y_0) K Here, the Xs (three words) represent R_{j-2} + M_{j-1}, the Ys repreent M_{j-2}, and the Zs represents R_j, as three words (without final folding). We would need 10 independent multiples, one more than with plain interleaving, but critical path includes only one multiply latency. I think this is a promising alternative, if one would otherwise need to interleaving a large number of blocks to get full utilization of the multipliers. ** How to choose ** When implementing one of those schemes, different processor resources may be the bottleneck. I'd expect it to be one of o Multiply latency, i.e, latency of the dependency chain from one block to the next (including also a few additions, but multiply latency willb e the main part). If this is the bottleneck, it means all other instructions can be scheduled in parallel, and the processor will sit idle for some cycles, waiting for a multiply to complete. Typical latency for multiply is 5 times longer than for an addition (but ratio difers quite a bit between processors, of course) o Multiply throughput, i.e., the maximum number of (independent) multiply instructions that can be run per cycle. Typical number is 0.5 -- 2. If this is the bottleneck, the processor will spend some cycles idle, waiting for a multiplier to be ready to accept a new input. o A superscalar processor can issue several instructinos in the same cycle, but there's a fix small limit. Typical number is 2 -- 6. So, e.g., if the processor can issue maximum 4 instructions per cycle, the evaluation loop consists of 40 instructions, and the loop actually runs in close to 10 cycles per iteration, then instruction issue is the bottleneck. The tricks discussed in this note are useful for finding an evaluation scheme where multiply latency isn't a bottleneck. But once a loop hits the limit on multiply throughput or instructions per cycle, other tricks are needed to optimize further. In particular, the postponed reduction has a cost in multiply throughput, since it needs some additional multiply instruction. I think one should aim to hit the limit on multiply throughput; that one is hard to negotiate (it's possible to reduce the number of multiply instructions somewhat, by the Karatsuba trick, but due to the additional overhead, likely to be useful only on processors with particularly low multiply throughput). Regards, /Niels -- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.

2 3

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

nettle-bugs February 2022