Hi!
This series introduces a mechanism to support arch specific, combined AES+GCM {en,de}cryption functions. These functions are stubbed by default and will fall-back to the separate hash and crypt functions if no arch override exists. The arch override can be provided either at build time via appropriate config options or using the FAT runtime mechanism.
An implementation combining AES+GCM _can potentially_ yield significant performance boosts by allowing for increased instruction parallelism, avoiding C-function call overhead, more flexibility in assembly fine-tuning, etc. This series provides such an implementation based on the existing optimized Nettle routines for POWER9 and later processors. Benchmark results on a POWER9 Blackbird running at 3.5GHz are given at the end of this mail. Both builds were configured statically ie. not FAT. FAT performance is slightly lower for both but shows similar gains with this series. The OpenSSL build is based on latest OpenSSL master with all PowerPC optimizations enabled.
Note that the gains on an early POWER10 system are even more impressive but unfortunately I cannot share those results publically yet :(
AES+GCM combined (this series) ------------------------------ Algorithm mode Mbyte/s
gcm_aes128 encrypt 2567.62 gcm_aes128 decrypt 2582.32 gcm_aes128 update 7724.15
gcm_aes192 encrypt 2279.39 gcm_aes192 decrypt 2293.20 gcm_aes192 update 7724.41
gcm_aes256 encrypt 2054.09 gcm_aes256 decrypt 2061.25 gcm_aes256 update 7724.04
openssl gcm_aes128 encrypt 2336.93 openssl gcm_aes128 decrypt 2337.95 openssl gcm_aes128 update 6248.22
openssl gcm_aes192 encrypt 2113.93 openssl gcm_aes192 decrypt 2114.93 openssl gcm_aes192 update 6210.65
openssl gcm_aes256 encrypt 1936.95 openssl gcm_aes256 decrypt 1935.88 openssl gcm_aes256 update 6208.72
AES,GCM separate (nettle master) -------------------------------- Algorithm mode Mbyte/s
gcm_aes128 encrypt 1418.66 gcm_aes128 decrypt 1418.97 gcm_aes128 update 7766.31
gcm_aes192 encrypt 1314.03 gcm_aes192 decrypt 1313.17 gcm_aes192 update 7760.23
gcm_aes256 encrypt 1218.75 gcm_aes256 decrypt 1218.64 gcm_aes256 update 7760.52
openssl gcm_aes128 encrypt 2324.70 openssl gcm_aes128 decrypt 2317.19 openssl gcm_aes128 update 6152.77
openssl gcm_aes192 encrypt 2102.99 openssl gcm_aes192 decrypt 2098.98 openssl gcm_aes192 update 6175.62
openssl gcm_aes256 encrypt 1925.85 openssl gcm_aes256 decrypt 1922.49 openssl gcm_aes256 update 6204.55
Christopher M. Riedl (6): gcm: Introduce gcm_aes_{de,en}crypt() ppc: Fix variable name for --enable-power-altivec ppc: Add FAT feature and config option for ISA 3.0 ppc: Add gcm_aes_encrypt() asm for ISA 3.0 (P9) ppc: Add gcm_aes_decrypt() asm for ISA 3.0 (P9) ppc: Enable gcm_aes_{de,en}crypt() FAT
configure.ac | 19 +- fat-ppc.c | 45 ++ fat-setup.h | 6 + gcm-internal.h | 14 + gcm.c | 151 ++++++- powerpc64/fat/gcm-aes-decrypt.asm | 37 ++ powerpc64/fat/gcm-aes-encrypt.asm | 37 ++ powerpc64/p9/gcm-aes-decrypt.asm | 663 +++++++++++++++++++++++++++++ powerpc64/p9/gcm-aes-encrypt.asm | 666 ++++++++++++++++++++++++++++++ 9 files changed, 1630 insertions(+), 8 deletions(-) create mode 100644 powerpc64/fat/gcm-aes-decrypt.asm create mode 100644 powerpc64/fat/gcm-aes-encrypt.asm create mode 100644 powerpc64/p9/gcm-aes-decrypt.asm create mode 100644 powerpc64/p9/gcm-aes-encrypt.asm
Currently the AES-GCM crypt and hash parts are performed in two separate functions. Each can be replaced with an arch-specific optimized assembly routine. This makes it difficult to introduce an arch-specific routine implementing the combination of both parts in a single function.
Rework the existing gcm_{en,de}crypt() functions to instead call a new gcm_aes_{en,de}crypt_wrap() function which calls out to a (for now) stub gcm_aes_{en,de}crypt(). This stub can be then overriden either via FAT or statically during build.
Signed-off-by: Christopher M. Riedl cmr@linux.ibm.com --- configure.ac | 8 ++- gcm.c | 147 +++++++++++++++++++++++++++++++++++++++++++++++++-- 2 files changed, 149 insertions(+), 6 deletions(-)
diff --git a/configure.ac b/configure.ac index 026ae99d..ba85a313 100644 --- a/configure.ac +++ b/configure.ac @@ -538,7 +538,7 @@ asm_nettle_optional_list="gcm-hash.asm gcm-hash8.asm cpuid.asm \ salsa20-2core.asm salsa20-core-internal-2.asm \ sha1-compress-2.asm sha256-compress-2.asm \ sha3-permute-2.asm sha512-compress-2.asm \ - umac-nh-n-2.asm umac-nh-2.asm" + umac-nh-n-2.asm umac-nh-2.asm gcm-aes-encrypt.asm gcm-aes-decrypt.asm"
asm_hogweed_optional_list="" if test "x$enable_public_key" = "xyes" ; then @@ -674,7 +674,11 @@ AH_VERBATIM([HAVE_NATIVE], #undef HAVE_NATIVE_sha512_compress #undef HAVE_NATIVE_sha3_permute #undef HAVE_NATIVE_umac_nh -#undef HAVE_NATIVE_umac_nh_n]) +#undef HAVE_NATIVE_umac_nh_n +#undef HAVE_NATIVE_gcm_aes_decrypt +#undef HAVE_NATIVE_gcm_aes_encrypt +#undef HAVE_NATIVE_fat_gcm_aes_decrypt +#undef HAVE_NATIVE_fat_gcm_aes_encrypt])
if test "x$enable_pic" = xyes; then LSH_CCPIC diff --git a/gcm.c b/gcm.c index d1f21d3a..6fe25a01 100644 --- a/gcm.c +++ b/gcm.c @@ -423,28 +423,167 @@ gcm_fill(uint8_t *ctr, size_t blocks, union nettle_block16 *buffer) } #endif
+enum gcm_aes_rounds { + NOT_AES = 0, + AES_128 = _AES128_ROUNDS, + AES_192 = _AES192_ROUNDS, + AES_256 = _AES256_ROUNDS +}; + +static enum gcm_aes_rounds +_nettle_gcm_get_aes_rounds(nettle_cipher_func *f) +{ + if (f == (nettle_cipher_func *)nettle_aes128_encrypt || + f == (nettle_cipher_func *)nettle_aes128_decrypt) + { + return AES_128; + } + else if (f == (nettle_cipher_func *)nettle_aes192_encrypt || + f == (nettle_cipher_func *)nettle_aes192_decrypt) + { + return AES_192; + } + else if (f == (nettle_cipher_func *)nettle_aes256_encrypt || + f == (nettle_cipher_func *)nettle_aes256_decrypt) + { + return AES_256; + } + else + { + return NOT_AES; + } +} + +#if !HAVE_NATIVE_gcm_aes_encrypt +# if !HAVE_NATIVE_fat_gcm_aes_encrypt +# define _nettle_gcm_aes_encrypt _nettle_gcm_aes_encrypt_c +static +#endif /* !HAVE_NATIVE_fat_gcm_aes_encrypt */ +int +_nettle_gcm_aes_encrypt_c (const struct gcm_key *key, union nettle_block16 *x, + size_t length, const uint8_t *src, unsigned rounds, + const uint32_t *keys, uint8_t *dst, uint8_t* ctr) +{ + (void)key; + (void)x; + (void)length; + (void)src; + (void)rounds; + (void)keys; + (void)dst; + (void)ctr; + + return -1; /* Not implemented */ +} +#endif /* !HAVE_NATIVE_gcm_aes_encrypt */ + +static int +_nettle_gcm_aes_encrypt_wrap (struct gcm_ctx *ctx, const struct gcm_key *key, + const void *cipher, size_t length, uint8_t *dst, + const uint8_t *src, enum gcm_aes_rounds rounds) +{ + switch (rounds) { + default: + abort(); + case AES_128: + return _nettle_gcm_aes_encrypt(key, &ctx->x, length, src, rounds, + ((struct aes128_ctx*)cipher)->keys, dst, + ctx->ctr.b); + case AES_192: + return _nettle_gcm_aes_encrypt(key, &ctx->x, length, src, rounds, + ((struct aes192_ctx*)cipher)->keys, dst, + ctx->ctr.b); + case AES_256: + return _nettle_gcm_aes_encrypt(key, &ctx->x, length, src, rounds, + ((struct aes256_ctx*)cipher)->keys, dst, + ctx->ctr.b); + } +} + void gcm_encrypt (struct gcm_ctx *ctx, const struct gcm_key *key, const void *cipher, nettle_cipher_func *f, size_t length, uint8_t *dst, const uint8_t *src) { + enum gcm_aes_rounds rounds; assert(ctx->data_size % GCM_BLOCK_SIZE == 0);
- _nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src); - _nettle_gcm_hash(key, &ctx->x, length, dst); + rounds = _nettle_gcm_get_aes_rounds(f); + + if (rounds == NOT_AES || + _nettle_gcm_aes_encrypt_wrap(ctx, key, cipher, length, + dst, src, rounds) == -1) + { + _nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src); + _nettle_gcm_hash(key, &ctx->x, length, dst); + }
ctx->data_size += length; }
+#if !HAVE_NATIVE_gcm_aes_decrypt +# if !HAVE_NATIVE_fat_gcm_aes_decrypt +# define _nettle_gcm_aes_decrypt _nettle_gcm_aes_decrypt_c +static +#endif /* !HAVE_NATIVE_fat_gcm_aes_decrypt */ +int +_nettle_gcm_aes_decrypt_c (const struct gcm_key *key, union nettle_block16 *x, + size_t length, const uint8_t *src, unsigned rounds, + const uint32_t *keys, uint8_t *dst, uint8_t *ctr) +{ + (void)key; + (void)x; + (void)length; + (void)src; + (void)rounds; + (void)keys; + (void)dst; + (void)ctr; + + return -1; /* Not implemented */ +} +#endif /* !HAVE_NATIVE_gcm_aes_decrypt */ + +static int +_nettle_gcm_aes_decrypt_wrap (struct gcm_ctx *ctx, const struct gcm_key *key, + const void *cipher, size_t length, uint8_t *dst, + const uint8_t *src, enum gcm_aes_rounds rounds) +{ + switch (rounds) { + default: + abort(); + case AES_128: + return _nettle_gcm_aes_decrypt(key, &ctx->x, length, src, rounds, + ((struct aes128_ctx*)cipher)->keys, dst, + ctx->ctr.b); + case AES_192: + return _nettle_gcm_aes_decrypt(key, &ctx->x, length, src, rounds, + ((struct aes192_ctx*)cipher)->keys, dst, + ctx->ctr.b); + case AES_256: + return _nettle_gcm_aes_decrypt(key, &ctx->x, length, src, rounds, + ((struct aes256_ctx*)cipher)->keys, dst, + ctx->ctr.b); + } +} + void gcm_decrypt(struct gcm_ctx *ctx, const struct gcm_key *key, const void *cipher, nettle_cipher_func *f, size_t length, uint8_t *dst, const uint8_t *src) { + enum gcm_aes_rounds rounds; assert(ctx->data_size % GCM_BLOCK_SIZE == 0);
- _nettle_gcm_hash(key, &ctx->x, length, src); - _nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src); + rounds = _nettle_gcm_get_aes_rounds(f); + + if (rounds == NOT_AES || + _nettle_gcm_aes_decrypt_wrap(ctx, key, cipher, length, + dst, src, rounds) == -1) + { + _nettle_gcm_hash(key, &ctx->x, length, src); + _nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src); + }
ctx->data_size += length; }
The AC_ARG_ENABLE(...) macro for --enable-power-altivec is called with enable_altivec=no as the default when the commandline option is not given to configure. However, the variable $enable_power_altivec is actually checked - not $enable_altivec. This doesn't matter in practice since $enable_power_altivec remains unset and the check works as expected when the commandline option is absent. Fix it anyways for consistency with the other arguments.
Signed-off-by: Christopher M. Riedl cmr@linux.ibm.com --- configure.ac | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/configure.ac b/configure.ac index ba85a313..253735a7 100644 --- a/configure.ac +++ b/configure.ac @@ -99,7 +99,7 @@ AC_ARG_ENABLE(power-crypto-ext,
AC_ARG_ENABLE(power-altivec, AC_HELP_STRING([--enable-power-altivec], [Enable POWER altivec and vsx extensions. (default=no)]),, - [enable_altivec=no]) + [enable_power_altivec=no])
AC_ARG_ENABLE(mini-gmp, AC_HELP_STRING([--enable-mini-gmp], [Enable mini-gmp, used instead of libgmp.]),,
Signed-off-by: Christopher M. Riedl cmr@linux.ibm.com --- configure.ac | 9 ++++++++- fat-ppc.c | 12 ++++++++++++ 2 files changed, 20 insertions(+), 1 deletion(-)
diff --git a/configure.ac b/configure.ac index 253735a7..a0df0cc8 100644 --- a/configure.ac +++ b/configure.ac @@ -101,6 +101,10 @@ AC_ARG_ENABLE(power-altivec, AC_HELP_STRING([--enable-power-altivec], [Enable POWER altivec and vsx extensions. (default=no)]),, [enable_power_altivec=no])
+AC_ARG_ENABLE(power-isa-30, + AC_HELP_STRING([--enable-power-isa-30], [Enable POWER ISA 3.0 (POWER9) features. (default=no)]),, + [enable_power_isa_30=no]) + AC_ARG_ENABLE(mini-gmp, AC_HELP_STRING([--enable-mini-gmp], [Enable mini-gmp, used instead of libgmp.]),, [enable_mini_gmp=no]) @@ -501,8 +505,11 @@ if test "x$enable_assembler" = xyes ; then if test "x$enable_fat" = xyes ; then asm_path="powerpc64/fat $asm_path" OPT_NETTLE_SOURCES="fat-ppc.c $OPT_NETTLE_SOURCES" - FAT_TEST_LIST="none crypto_ext altivec" + FAT_TEST_LIST="none crypto_ext altivec isa_30" else + if test "$enable_power_isa_30" = yes ; then + asm_path="powerpc64/p9 $asm_path" + fi if test "$enable_power_crypto_ext" = yes ; then asm_path="powerpc64/p8 $asm_path" fi diff --git a/fat-ppc.c b/fat-ppc.c index 3adbb88c..67ef46ab 100644 --- a/fat-ppc.c +++ b/fat-ppc.c @@ -78,11 +78,15 @@ #ifndef PPC_FEATURE2_VEC_CRYPTO #define PPC_FEATURE2_VEC_CRYPTO 0x02000000 #endif +#ifndef PPC_FEATURE2_ARCH_3_00 +#define PPC_FEATURE2_ARCH_3_00 0x00800000 +#endif
struct ppc_features { int have_crypto_ext; int have_altivec; + int have_isa_30; };
#define MATCH(s, slen, literal, llen) \ @@ -94,6 +98,7 @@ get_ppc_features (struct ppc_features *features) const char *s; features->have_crypto_ext = 0; features->have_altivec = 0; + features->have_isa_30 = 0;
s = secure_getenv (ENV_OVERRIDE); if (s) @@ -106,6 +111,8 @@ get_ppc_features (struct ppc_features *features) features->have_crypto_ext = 1; else if (MATCH(s, length, "altivec", 7)) features->have_altivec = 1; + else if (MATCH(s, length, "isa_30", 6)) + features->have_isa_30 = 1; if (!sep) break; s = sep + 1; @@ -116,6 +123,8 @@ get_ppc_features (struct ppc_features *features) features->have_crypto_ext = _system_configuration.implementation >= 0x10000u; features->have_altivec = _system_configuration.vmx_version > 1; + /* TODO: AIX magic bits to decode ISA 3.0 / POWER9 support */ + features->have_isa_30 = 0; #else unsigned long hwcap = 0; unsigned long hwcap2 = 0; @@ -141,6 +150,9 @@ get_ppc_features (struct ppc_features *features) features->have_altivec = ((hwcap & (PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_VSX)) == (PPC_FEATURE_HAS_ALTIVEC | PPC_FEATURE_HAS_VSX)); + + features->have_isa_30 + = ((hwcap2 & PPC_FEATURE2_ARCH_3_00) == PPC_FEATURE2_ARCH_3_00); #endif } }
This implementation is based on the existing, per-algorithm optimized powerpc64/p8/aes-encrypt-internal.asm and powerpc64/p8/gcm-hash.asm implementations by Niels Möller and Mamone Tarsha.
Significant changes:
- Combine AES + GCM into a single function call which does up-to 8x unrolled AES followed by 2x 4x unrolled GCM back-to-back. - Handle the IV|CTR increment in assembly and avoid the somewhat costly gcm_fill() call to precalculate the counter values. - Use ISA 3.0 (P9) lxvb16x/stxvb16x to load/store unaligned VSX registers to avoid permutes on LE machines. - Use ISA 3.0 (P9) lxvll/stxvll to load/store left-aligned, zero-padded partial (<16B) blocks. - Use ISA 3.0 (P9) lxv/stxv to load/store the non-volatile vector registers from/to the stack redzone to avoid using a GPR register as an index.
Signed-off-by: Christopher M. Riedl cmr@linux.ibm.com --- gcm.c | 4 + powerpc64/p9/gcm-aes-encrypt.asm | 666 +++++++++++++++++++++++++++++++ 2 files changed, 670 insertions(+) create mode 100644 powerpc64/p9/gcm-aes-encrypt.asm
diff --git a/gcm.c b/gcm.c index 6fe25a01..39e7a7c7 100644 --- a/gcm.c +++ b/gcm.c @@ -61,8 +61,12 @@ GCM_TABLE_BITS == 8 layout */ #undef HAVE_NATIVE_gcm_hash #undef HAVE_NATIVE_gcm_init_key +#undef HAVE_NATIVE_gcm_aes_decrypt +#undef HAVE_NATIVE_gcm_aes_encrypt #undef HAVE_NATIVE_fat_gcm_hash #undef HAVE_NATIVE_fat_gcm_init_key +#undef HAVE_NATIVE_fat_gcm_aes_decrypt +#undef HAVE_NATIVE_fat_gcm_aes_encrypt #endif
#if !HAVE_NATIVE_gcm_hash diff --git a/powerpc64/p9/gcm-aes-encrypt.asm b/powerpc64/p9/gcm-aes-encrypt.asm new file mode 100644 index 00000000..43f577fa --- /dev/null +++ b/powerpc64/p9/gcm-aes-encrypt.asm @@ -0,0 +1,666 @@ +C powerpc64/p9/gcm-aes-encrypt.asm + +ifelse(` + Copyright (C) 2020 Niels Möller and Mamone Tarsha + Copyright (C) 2021 Christopher M. Riedl + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + +.file "gcm-aes-encrypt.asm" + +.text + +C void gcm_aes_encrypt(const struct gcm_key *key, union gcm_block *x, +C size_t length, const uint8_t *src, +C unsigned rounds, const uint32_t *keys, +C uint8_t *dst, uint32_t *ctr) + +C Register usage: +define(`SP', `r1') +define(`TOCP', `r2') + +C Parameters: +define(`TABLE', `r3') +define(`X', `r4') C Output GCM/Ghash tag +define(`LENGTH',`r5') +define(`SRC', `r6') C Plaintext input +define(`ROUNDS',`r7') +define(`KEYS', `r8') +define(`DST', `r9') +define(`PCTR', `r10') C Pointer to 12B IV and starting 4B ctr + +C GCM/Ghash: +define(`POLY_L',`v0') +define(`D', `v1') +define(`H1M', `v6') +define(`H1L', `v7') +define(`H2M', `v8') +define(`H2L', `v9') +define(`H3M', `v10') +define(`H3L', `v11') +define(`H4M', `v12') +define(`H4L', `v13') +define(`R', `v14') +define(`F', `v15') +define(`R2', `v16') +define(`F2', `v17') +define(`T', `v18') +define(`R3', `v20') +define(`F3', `v21') +define(`R4', `v22') +define(`F4', `v23') + +C AES: +define(`K', `v25') +define(`S0', `v2') +define(`S1', `v3') +define(`S2', `v4') +define(`S3', `v5') +define(`S4', `v26') +define(`S5', `v27') +define(`S6', `v28') +define(`S7', `v29') +define(`CTR', `v30') +define(`INC', `v31') +define(`C0', `v14') +define(`C1', `v15') +define(`C2', `v16') +define(`C3', `v17') +define(`C4', `v20') +define(`C5', `v21') +define(`C6', `v22') +define(`C7', `v23') + +define(`LCNT', `r14') +define(`ZERO', `v16') +define(`POLY', `v24') +C misc: r15,r16,r17 + +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_gcm_aes_encrypt) + + vxor ZERO,ZERO,ZERO + subi ROUNDS,ROUNDS,1 C Last AES round uses vcipherlast + + C Store non-volatiles on the 288B stack redzone + std r14,-8*1(SP) + std r15,-8*2(SP) + std r16,-8*3(SP) + std r17,-8*4(SP) + stxv VSR(v20),-16*3(SP) + stxv VSR(v21),-16*4(SP) + stxv VSR(v22),-16*5(SP) + stxv VSR(v23),-16*6(SP) + stxv VSR(v24),-16*7(SP) + stxv VSR(v25),-16*8(SP) + stxv VSR(v26),-16*9(SP) + stxv VSR(v27),-16*10(SP) + stxv VSR(v28),-16*11(SP) + stxv VSR(v29),-16*12(SP) + stxv VSR(v30),-16*13(SP) + stxv VSR(v31),-16*14(SP) + + DATA_LOAD_VEC(POLY,.polynomial,r14) + DATA_LOAD_VEC(INC,.increment,r14) + + lxvb16x VSR(CTR),0,PCTR C Load 'ctr' pointer + xxmrghd VSR(POLY_L),VSR(ZERO),VSR(POLY) + lxvb16x VSR(D),0,X C load 'X' pointer + +L8x: + C --- process 8 blocks '128-bit each' per one loop --- + srdi. LCNT,LENGTH,7 C 8-blocks loop count 'LENGTH / (8 * 16)' + beq L4x + + C load table elements + li r15,4*16 + li r16,5*16 + li r17,6*16 + lxvd2x VSR(H3M),r15,TABLE + lxvd2x VSR(H3L),r16,TABLE + lxvd2x VSR(H4M),r17,TABLE + li r16,7*16 + lxvd2x VSR(H4L),r16,TABLE + li r15,1*16 + li r16,2*16 + li r17,3*16 + lxvd2x VSR(H1M),0,TABLE + lxvd2x VSR(H1L),r15,TABLE + lxvd2x VSR(H2M),r16,TABLE + lxvd2x VSR(H2L),r17,TABLE + +L8x_loop: +L8x_aes: + lxvb16x VSR(K),0,KEYS + + C Increment ctr + vmr S0,CTR + vadduwm CTR,CTR,INC + vxor S0,S0,K + vmr S1,CTR + vadduwm CTR,CTR,INC + vxor S1,S1,K + vmr S2,CTR + vadduwm CTR,CTR,INC + vxor S2,S2,K + vmr S3,CTR + vadduwm CTR,CTR,INC + vxor S3,S3,K + + mtctr ROUNDS + li r15,1*16 + + vmr S4,CTR + vadduwm CTR,CTR,INC + vxor S4,S4,K + vmr S5,CTR + vadduwm CTR,CTR,INC + vxor S5,S5,K + vmr S6,CTR + vadduwm CTR,CTR,INC + vxor S6,S6,K + vmr S7,CTR + vadduwm CTR,CTR,INC + vxor S7,S7,K + +.align 5 +L8x_aes_rnd_loop: + lxvb16x VSR(K),r15,KEYS + addi r15,r15,1*16 + vcipher S0,S0,K + vcipher S1,S1,K + vcipher S2,S2,K + vcipher S3,S3,K + vcipher S4,S4,K + vcipher S5,S5,K + vcipher S6,S6,K + vcipher S7,S7,K + bdnz L8x_aes_rnd_loop + + lxvb16x VSR(K),r15,KEYS + vcipherlast S0,S0,K + vcipherlast S1,S1,K + vcipherlast S2,S2,K + vcipherlast S3,S3,K + vcipherlast S4,S4,K + vcipherlast S5,S5,K + vcipherlast S6,S6,K + vcipherlast S7,S7,K + + C AES(counter) XOR plaintext = ciphertext + li r15,1*16 + li r16,2*16 + li r17,3*16 + lxvb16x VSR(C0),0,SRC + lxvb16x VSR(C1),r15,SRC + lxvb16x VSR(C2),r16,SRC + lxvb16x VSR(C3),r17,SRC + vxor S0,C0,S0 + vxor S1,C1,S1 + vxor S2,C2,S2 + vxor S3,C3,S3 + + addi SRC,SRC,4*16 + lxvb16x VSR(C4),0,SRC + lxvb16x VSR(C5),r15,SRC + lxvb16x VSR(C6),r16,SRC + lxvb16x VSR(C7),r17,SRC + vxor S4,C4,S4 + vxor S5,C5,S5 + vxor S6,C6,S6 + vxor S7,C7,S7 + + C Store ciphertext + stxvb16x VSR(S0),0,DST + stxvb16x VSR(S1),r15,DST + stxvb16x VSR(S2),r16,DST + stxvb16x VSR(S3),r17,DST + addi DST,DST,4*16 + stxvb16x VSR(S4),0,DST + stxvb16x VSR(S5),r15,DST + stxvb16x VSR(S6),r16,DST + stxvb16x VSR(S7),r17,DST + + addi SRC,SRC,4*16 + addi DST,DST,4*16 + +L8x_gcm: + C previous digest combining + vxor S0,S0,D + + C polynomial multiplication + vpmsumd F2,H3L,S1 + vpmsumd R2,H3M,S1 + vpmsumd F3,H2L,S2 + vpmsumd R3,H2M,S2 + vpmsumd F4,H1L,S3 + vpmsumd R4,H1M,S3 + vpmsumd F,H4L,S0 + vpmsumd R,H4M,S0 + + C deferred recombination of partial products + vxor F3,F3,F4 + vxor R3,R3,R4 + vxor F,F,F2 + vxor R,R,R2 + vxor F,F,F3 + vxor R,R,R3 + + C reduction + vpmsumd T,F,POLY_L + xxswapd VSR(D),VSR(F) + vxor R,R,T + vxor D,R,D + + C previous digest combining + vxor S4,S4,D + + C polynomial multiplication + vpmsumd F2,H3L,S5 + vpmsumd R2,H3M,S5 + vpmsumd F3,H2L,S6 + vpmsumd R3,H2M,S6 + vpmsumd F4,H1L,S7 + vpmsumd R4,H1M,S7 + vpmsumd F,H4L,S4 + vpmsumd R,H4M,S4 + + C deferred recombination of partial products + vxor F3,F3,F4 + vxor R3,R3,R4 + vxor F,F,F2 + vxor R,R,R2 + vxor F,F,F3 + vxor R,R,R3 + + C reduction + vpmsumd T,F,POLY_L + xxswapd VSR(D),VSR(F) + vxor R,R,T + vxor D,R,D + + C Decrement 8x block count and check if done + subi LCNT,LCNT,1 + cmpldi LCNT,0 + bne L8x_loop + clrldi LENGTH,LENGTH,57 C 'set the high-order 57 bits to zeros' + +L4x: + C --- process 4 blocks --- + srdi. LCNT,LENGTH,6 C 4-blocks loop count 'LENGTH / (4 * 16)' + beq L2x + + C load table elements + li r15,4*16 + li r16,5*16 + li r17,6*16 + lxvd2x VSR(H3M),r15,TABLE + lxvd2x VSR(H3L),r16,TABLE + lxvd2x VSR(H4M),r17,TABLE + li r16,7*16 + lxvd2x VSR(H4L),r16,TABLE + li r15,1*16 + li r16,2*16 + li r17,3*16 + lxvd2x VSR(H1M),0,TABLE + lxvd2x VSR(H1L),r15,TABLE + lxvd2x VSR(H2M),r16,TABLE + lxvd2x VSR(H2L),r17,TABLE + +L4x_aes: + lxvb16x VSR(K),0,KEYS + + C Increment ctr + vmr S0,CTR + vadduwm CTR,CTR,INC + vmr S1,CTR + vadduwm CTR,CTR,INC + vmr S2,CTR + vadduwm CTR,CTR,INC + vmr S3,CTR + vadduwm CTR,CTR,INC + + vxor S0,S0,K + vxor S1,S1,K + vxor S2,S2,K + vxor S3,S3,K + + mtctr ROUNDS + li r15,1*16 + +.align 5 +L4x_aes_rnd_loop: + lxvb16x VSR(K),r15,KEYS + vcipher S0,S0,K + vcipher S1,S1,K + vcipher S2,S2,K + vcipher S3,S3,K + addi r15,r15,1*16 + bdnz L4x_aes_rnd_loop + + lxvb16x VSR(K),r15,KEYS + vcipherlast S0,S0,K + vcipherlast S1,S1,K + vcipherlast S2,S2,K + vcipherlast S3,S3,K + + C AES(counter) XOR plaintext = ciphertext + li r15,1*16 + li r16,2*16 + li r17,3*16 + lxvb16x VSR(C0),0,SRC + lxvb16x VSR(C1),r15,SRC + lxvb16x VSR(C2),r16,SRC + lxvb16x VSR(C3),r17,SRC + vxor S0,C0,S0 + vxor S1,C1,S1 + vxor S2,C2,S2 + vxor S3,C3,S3 + + C Store ciphertext in DST + stxvb16x VSR(S0),0,DST + stxvb16x VSR(S1),r15,DST + stxvb16x VSR(S2),r16,DST + stxvb16x VSR(S3),r17,DST + +L4x_gcm: + C previous digest combining + vxor S0,S0,D + + C polynomial multiplication + vpmsumd F2,H3L,S1 + vpmsumd R2,H3M,S1 + vpmsumd F3,H2L,S2 + vpmsumd R3,H2M,S2 + vpmsumd F4,H1L,S3 + vpmsumd R4,H1M,S3 + vpmsumd F,H4L,S0 + vpmsumd R,H4M,S0 + + C deferred recombination of partial products + vxor F3,F3,F4 + vxor R3,R3,R4 + vxor F,F,F2 + vxor R,R,R2 + vxor F,F,F3 + vxor R,R,R3 + + C reduction + vpmsumd T,F,POLY_L + xxswapd VSR(D),VSR(F) + vxor R,R,T + vxor D,R,D + + addi DST,DST,4*16 + addi SRC,SRC,4*16 + clrldi LENGTH,LENGTH,58 C 'set the high-order 58 bits to zeros' + +L2x: + C --- process 2 blocks --- + srdi. r14,LENGTH,5 C 'LENGTH / (2 * 16)' + beq L1x + + C load table elements + li r15,1*16 + li r16,2*16 + li r17,3*16 + lxvd2x VSR(H1M),0,TABLE + lxvd2x VSR(H1L),r15,TABLE + lxvd2x VSR(H2M),r16,TABLE + lxvd2x VSR(H2L),r17,TABLE + +L2x_aes: + lxvb16x VSR(K),0,KEYS + + C Increment ctr + vmr S0,CTR + vadduwm CTR,CTR,INC + vmr S1,CTR + vadduwm CTR,CTR,INC + + vxor S0,S0,K + vxor S1,S1,K + + mtctr ROUNDS + li r15,1*16 + +.align 5 +L2x_aes_rnd_loop: + lxvb16x VSR(K),r15,KEYS + vcipher S0,S0,K + vcipher S1,S1,K + addi r15,r15,1*16 + bdnz L2x_aes_rnd_loop + + lxvb16x VSR(K),r15,KEYS + vcipherlast S0,S0,K + vcipherlast S1,S1,K + + C AES(counter) XOR plaintext = ciphertext + li r15,1*16 + lxvb16x VSR(C0),0,SRC + lxvb16x VSR(C1),r15,SRC + vxor S0,C0,S0 + vxor S1,C1,S1 + + C Store ciphertext in DST + stxvb16x VSR(S0),0,DST + stxvb16x VSR(S1),r15,DST + +L2x_gcm: + C previous digest combining + vxor S0,S0,D + + C polynomial multiplication + vpmsumd F2,H1L,S1 + vpmsumd R2,H1M,S1 + vpmsumd F,H2L,S0 + vpmsumd R,H2M,S0 + + C deferred recombination of partial products + vxor F,F,F2 + vxor R,R,R2 + + C reduction + vpmsumd T,F,POLY_L + xxswapd VSR(D),VSR(F) + vxor R,R,T + vxor D,R,D + + addi DST,DST,2*16 + addi SRC,SRC,2*16 + clrldi LENGTH,LENGTH,59 C 'set the high-order 59 bits to zeros' + +L1x: + C --- process 1 block --- + srdi. r14,LENGTH,4 C 'LENGTH / (1 * 16)' + beq Lpartial + + C load table elements + li r15,1*16 + lxvd2x VSR(H1M),0,TABLE + lxvd2x VSR(H1L),r15,TABLE + +L1x_aes: + lxvb16x VSR(K),0,KEYS + + C Increment ctr + vmr S0,CTR + vadduwm CTR,CTR,INC + + vxor S0,S0,K + + mtctr ROUNDS + li r15,1*16 + +.align 5 +L1x_aes_rnd_loop: + lxvb16x VSR(K),r15,KEYS + vcipher S0,S0,K + addi r15,r15,1*16 + bdnz L1x_aes_rnd_loop + + lxvb16x VSR(K),r15,KEYS + vcipherlast S0,S0,K + + C AES(counter) XOR plaintext = ciphertext + lxvb16x VSR(C0),0,SRC + vxor S0,C0,S0 + + C Store ciphertext in DST + stxvb16x VSR(S0),0,DST + +L1x_gcm: + C previous digest combining + vxor S0,S0,D + + C polynomial multiplication + vpmsumd F,H1L,S0 + vpmsumd R,H1M,S0 + + C reduction + vpmsumd T,F,POLY_L + xxswapd VSR(D),VSR(F) + vxor R,R,T + vxor D,R,D + + addi DST,DST,1*16 + addi SRC,SRC,1*16 + clrldi LENGTH,LENGTH,60 C 'set the high-order 60 bits to zeros' + +Lpartial: + C --- process partial block --- + cmpldi LENGTH,0 + beq Ldone + + C load table elements + li r15,1*16 + lxvd2x VSR(H1M),0,TABLE + lxvd2x VSR(H1L),r15,TABLE + +Lpartial_aes: + lxvb16x VSR(K),0,KEYS + + C Increment ctr + vmr S0,CTR + vadduwm CTR,CTR,INC + + vxor S0,S0,K + + mtctr ROUNDS + li r15,1*16 + +.align 5 +Lpartial_aes_rnd_loop: + lxvb16x VSR(K),r15,KEYS + vcipher S0,S0,K + addi r15,r15,1*16 + bdnz Lpartial_aes_rnd_loop + + lxvb16x VSR(K),r15,KEYS + vcipherlast S0,S0,K + + C Load the partial block left-aligned and zero-padded + sldi LENGTH,LENGTH,56 + lxvll VSR(C0),SRC,LENGTH + + C AES(counter) XOR plaintext = ciphertext + vxor S0,C0,S0 + + C Store ciphertext in DST + stxvll VSR(S0),DST,LENGTH + + C TODO: Lazy, reload the value to zero-out the padding bits again + lxvll VSR(S0),DST,LENGTH + +Lpartial_gcm: + C previous digest combining + vxor S0,S0,D + + C polynomial multiplication + vpmsumd F,H1L,S0 + vpmsumd R,H1M,S0 + + C reduction + vpmsumd T,F,POLY_L + xxswapd VSR(D),VSR(F) + vxor R,R,T + vxor D,R,D + +Ldone: + stxvb16x VSR(D),0,X C store digest 'D' + stxvb16x VSR(CTR),0,PCTR C store updated 'ctr' + + C Restore non-volatiles from the 288B stack redzone + ld r14,-8*1(SP) + ld r15,-8*2(SP) + ld r16,-8*3(SP) + ld r17,-8*4(SP) + lxv VSR(v20),-16*3(SP) + lxv VSR(v21),-16*4(SP) + lxv VSR(v22),-16*5(SP) + lxv VSR(v23),-16*6(SP) + lxv VSR(v24),-16*7(SP) + lxv VSR(v25),-16*8(SP) + lxv VSR(v26),-16*9(SP) + lxv VSR(v27),-16*10(SP) + lxv VSR(v28),-16*11(SP) + lxv VSR(v29),-16*12(SP) + lxv VSR(v30),-16*13(SP) + lxv VSR(v31),-16*14(SP) + + li r3,0 C return 0 for success + blr + +EPILOGUE(_nettle_gcm_aes_encrypt) + +.data +.align 4 +C 0xC2000000000000000000000000000001 +.polynomial: +IF_BE(` + .byte 0xC2 + .rept 14 + .byte 0x00 + .endr + .byte 0x01 +',` + .byte 0x01 + .rept 14 + .byte 0x00 + .endr + .byte 0xC2 +') +.align 4 +.increment: +IF_LE(` + .byte 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +') +IF_BE(` + .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 +')
This implementation is based on the existing, per-algorithm optimized powerpc64/p8/aes-encrypt-internal.asm and powerpc64/p8/gcm-hash.asm implementations by Niels Möller and Mamone Tarsha. See the previous gcm_aes_encrypt() commit for details about major changes.
Signed-off-by: Christopher M. Riedl cmr@linux.ibm.com --- powerpc64/p9/gcm-aes-decrypt.asm | 663 +++++++++++++++++++++++++++++++ 1 file changed, 663 insertions(+) create mode 100644 powerpc64/p9/gcm-aes-decrypt.asm
diff --git a/powerpc64/p9/gcm-aes-decrypt.asm b/powerpc64/p9/gcm-aes-decrypt.asm new file mode 100644 index 00000000..4316a487 --- /dev/null +++ b/powerpc64/p9/gcm-aes-decrypt.asm @@ -0,0 +1,663 @@ +C powerpc64/p9/gcm-aes-decrypt.asm + +ifelse(` + Copyright (C) 2020 Niels Möller and Mamone Tarsha + Copyright (C) 2021 Christopher M. Riedl + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + + +.file "gcm-aes-decrypt.asm" + +.text + +C void gcm_aes_decrypt(const struct gcm_key *key, union gcm_block *x, +C size_t length, const uint8_t *src, +C unsigned rounds, const uint32_t *keys, +C uint8_t *dst, uint32_t *ctr) + +C Register usage: +define(`SP', `r1') +define(`TOCP', `r2') + +C Parameters: +define(`TABLE', `r3') +define(`X', `r4') C Output GCM/Ghash tag +define(`LENGTH',`r5') +define(`SRC', `r6') C Ciphertext input +define(`ROUNDS',`r7') +define(`KEYS', `r8') +define(`DST', `r9') +define(`PCTR', `r10') C Pointer to 12B IV and starting 4B ctr + +C GCM/Ghash: +define(`POLY_L',`v0') +define(`D', `v1') +define(`H1M', `v6') +define(`H1L', `v7') +define(`H2M', `v8') +define(`H2L', `v9') +define(`H3M', `v10') +define(`H3L', `v11') +define(`H4M', `v12') +define(`H4L', `v13') +define(`R', `v14') +define(`F', `v15') +define(`R2', `v16') +define(`F2', `v17') +define(`T', `v18') +define(`R3', `v20') +define(`F3', `v21') +define(`R4', `v22') +define(`F4', `v23') + +C AES: +define(`K', `v25') +define(`S0', `v2') +define(`S1', `v3') +define(`S2', `v4') +define(`S3', `v5') +define(`S4', `v26') +define(`S5', `v27') +define(`S6', `v28') +define(`S7', `v29') +define(`CTR', `v30') +define(`INC', `v31') +define(`C0', `v14') +define(`C1', `v15') +define(`C2', `v16') +define(`C3', `v17') +define(`C4', `v20') +define(`C5', `v21') +define(`C6', `v22') +define(`C7', `v23') + +define(`LCNT', `r14') +define(`ZERO', `v16') +define(`POLY', `v24') +C misc: r15,r16,r17 + +define(`FUNC_ALIGN', `5') +PROLOGUE(_nettle_gcm_aes_decrypt) + + vxor ZERO,ZERO,ZERO + subi ROUNDS,ROUNDS,1 C Last AES round uses vcipherlast + + C Store non-volatiles on the 288B stack redzone + std r14,-8*1(SP) + std r15,-8*2(SP) + std r16,-8*3(SP) + std r17,-8*4(SP) + stxv VSR(v20),-16*3(SP) + stxv VSR(v21),-16*4(SP) + stxv VSR(v22),-16*5(SP) + stxv VSR(v23),-16*6(SP) + stxv VSR(v24),-16*7(SP) + stxv VSR(v25),-16*8(SP) + stxv VSR(v26),-16*9(SP) + stxv VSR(v27),-16*10(SP) + stxv VSR(v28),-16*11(SP) + stxv VSR(v29),-16*12(SP) + stxv VSR(v30),-16*13(SP) + stxv VSR(v31),-16*14(SP) + + DATA_LOAD_VEC(POLY,.polynomial,r14) + DATA_LOAD_VEC(INC,.increment,r14) + + lxvb16x VSR(CTR),0,PCTR C Load 'ctr' pointer + xxmrghd VSR(POLY_L),VSR(ZERO),VSR(POLY) + lxvb16x VSR(D),0,X C load 'X' pointer + +L8x: + C --- process 8 blocks '128-bit each' per one loop --- + srdi. LCNT,LENGTH,7 C 8-blocks loop count 'LENGTH / (8 * 16)' + beq L4x + + C load table elements + li r15,4*16 + li r16,5*16 + li r17,6*16 + lxvd2x VSR(H3M),r15,TABLE + lxvd2x VSR(H3L),r16,TABLE + lxvd2x VSR(H4M),r17,TABLE + li r16,7*16 + lxvd2x VSR(H4L),r16,TABLE + li r15,1*16 + li r16,2*16 + li r17,3*16 + lxvd2x VSR(H1M),0,TABLE + lxvd2x VSR(H1L),r15,TABLE + lxvd2x VSR(H2M),r16,TABLE + lxvd2x VSR(H2L),r17,TABLE + +L8x_loop: +L8x_aes: + lxvb16x VSR(K),0,KEYS + + C Increment ctr + vmr C0,CTR + vadduwm CTR,CTR,INC + vxor C0,C0,K + vmr C1,CTR + vadduwm CTR,CTR,INC + vxor C1,C1,K + vmr C2,CTR + vadduwm CTR,CTR,INC + vxor C2,C2,K + vmr C3,CTR + vadduwm CTR,CTR,INC + vxor C3,C3,K + + mtctr ROUNDS + li r15,1*16 + + vmr C4,CTR + vadduwm CTR,CTR,INC + vxor C4,C4,K + vmr C5,CTR + vadduwm CTR,CTR,INC + vxor C5,C5,K + vmr C6,CTR + vadduwm CTR,CTR,INC + vxor C6,C6,K + vmr C7,CTR + vadduwm CTR,CTR,INC + vxor C7,C7,K + +.align 5 +L8x_aes_rnd_loop: + lxvb16x VSR(K),r15,KEYS + addi r15,r15,1*16 + vcipher C0,C0,K + vcipher C1,C1,K + vcipher C2,C2,K + vcipher C3,C3,K + vcipher C4,C4,K + vcipher C5,C5,K + vcipher C6,C6,K + vcipher C7,C7,K + bdnz L8x_aes_rnd_loop + + lxvb16x VSR(K),r15,KEYS + vcipherlast C0,C0,K + vcipherlast C1,C1,K + vcipherlast C2,C2,K + vcipherlast C3,C3,K + vcipherlast C4,C4,K + vcipherlast C5,C5,K + vcipherlast C6,C6,K + vcipherlast C7,C7,K + + C AES(counter) XOR ciphertext = plaintext + li r15,1*16 + li r16,2*16 + li r17,3*16 + lxvb16x VSR(S0),0,SRC + lxvb16x VSR(S1),r15,SRC + lxvb16x VSR(S2),r16,SRC + lxvb16x VSR(S3),r17,SRC + vxor C0,C0,S0 + vxor C1,C1,S1 + vxor C2,C2,S2 + vxor C3,C3,S3 + + addi SRC,SRC,4*16 + lxvb16x VSR(S4),0,SRC + lxvb16x VSR(S5),r15,SRC + lxvb16x VSR(S6),r16,SRC + lxvb16x VSR(S7),r17,SRC + vxor C4,C4,S4 + vxor C5,C5,S5 + vxor C6,C6,S6 + vxor C7,C7,S7 + + C Store plaintext + stxvb16x VSR(C0),0,DST + stxvb16x VSR(C1),r15,DST + stxvb16x VSR(C2),r16,DST + stxvb16x VSR(C3),r17,DST + addi DST,DST,4*16 + stxvb16x VSR(C4),0,DST + stxvb16x VSR(C5),r15,DST + stxvb16x VSR(C6),r16,DST + stxvb16x VSR(C7),r17,DST + + addi SRC,SRC,4*16 + addi DST,DST,4*16 + +L8x_gcm: + C previous digest combining + vxor S0,S0,D + + C polynomial multiplication + vpmsumd F2,H3L,S1 + vpmsumd R2,H3M,S1 + vpmsumd F3,H2L,S2 + vpmsumd R3,H2M,S2 + vpmsumd F4,H1L,S3 + vpmsumd R4,H1M,S3 + vpmsumd F,H4L,S0 + vpmsumd R,H4M,S0 + + C deferred recombination of partial products + vxor F3,F3,F4 + vxor R3,R3,R4 + vxor F,F,F2 + vxor R,R,R2 + vxor F,F,F3 + vxor R,R,R3 + + C reduction + vpmsumd T,F,POLY_L + xxswapd VSR(D),VSR(F) + vxor R,R,T + vxor D,R,D + + C previous digest combining + vxor S4,S4,D + + C polynomial multiplication + vpmsumd F2,H3L,S5 + vpmsumd R2,H3M,S5 + vpmsumd F3,H2L,S6 + vpmsumd R3,H2M,S6 + vpmsumd F4,H1L,S7 + vpmsumd R4,H1M,S7 + vpmsumd F,H4L,S4 + vpmsumd R,H4M,S4 + + C deferred recombination of partial products + vxor F3,F3,F4 + vxor R3,R3,R4 + vxor F,F,F2 + vxor R,R,R2 + vxor F,F,F3 + vxor R,R,R3 + + C reduction + vpmsumd T,F,POLY_L + xxswapd VSR(D),VSR(F) + vxor R,R,T + vxor D,R,D + + C Decrement 8x block count and check if done + subi LCNT,LCNT,1 + cmpldi LCNT,0 + bne L8x_loop + clrldi LENGTH,LENGTH,57 C 'set the high-order 57 bits to zeros' + +L4x: + C --- process 4 blocks --- + srdi. LCNT,LENGTH,6 C 4-blocks loop count 'LENGTH / (4 * 16)' + beq L2x + + C load table elements + li r15,4*16 + li r16,5*16 + li r17,6*16 + lxvd2x VSR(H3M),r15,TABLE + lxvd2x VSR(H3L),r16,TABLE + lxvd2x VSR(H4M),r17,TABLE + li r16,7*16 + lxvd2x VSR(H4L),r16,TABLE + li r15,1*16 + li r16,2*16 + li r17,3*16 + lxvd2x VSR(H1M),0,TABLE + lxvd2x VSR(H1L),r15,TABLE + lxvd2x VSR(H2M),r16,TABLE + lxvd2x VSR(H2L),r17,TABLE + +L4x_aes: + lxvb16x VSR(K),0,KEYS + + C Increment ctr + vmr C0,CTR + vadduwm CTR,CTR,INC + vmr C1,CTR + vadduwm CTR,CTR,INC + vmr C2,CTR + vadduwm CTR,CTR,INC + vmr C3,CTR + vadduwm CTR,CTR,INC + + vxor C0,C0,K + vxor C1,C1,K + vxor C2,C2,K + vxor C3,C3,K + + mtctr ROUNDS + li r15,1*16 + +.align 5 +L4x_aes_rnd_loop: + lxvb16x VSR(K),r15,KEYS + vcipher C0,C0,K + vcipher C1,C1,K + vcipher C2,C2,K + vcipher C3,C3,K + addi r15,r15,1*16 + bdnz L4x_aes_rnd_loop + + lxvb16x VSR(K),r15,KEYS + vcipherlast C0,C0,K + vcipherlast C1,C1,K + vcipherlast C2,C2,K + vcipherlast C3,C3,K + + C AES(counter) XOR ciphertext = plaintext + li r15,1*16 + li r16,2*16 + li r17,3*16 + lxvb16x VSR(S0),0,SRC + lxvb16x VSR(S1),r15,SRC + lxvb16x VSR(S2),r16,SRC + lxvb16x VSR(S3),r17,SRC + vxor C0,C0,S0 + vxor C1,C1,S1 + vxor C2,C2,S2 + vxor C3,C3,S3 + + C Store plaintext in DST + stxvb16x VSR(C0),0,DST + stxvb16x VSR(C1),r15,DST + stxvb16x VSR(C2),r16,DST + stxvb16x VSR(C3),r17,DST + +L4x_gcm: + C previous digest combining + vxor S0,S0,D + + C polynomial multiplication + vpmsumd F2,H3L,S1 + vpmsumd R2,H3M,S1 + vpmsumd F3,H2L,S2 + vpmsumd R3,H2M,S2 + vpmsumd F4,H1L,S3 + vpmsumd R4,H1M,S3 + vpmsumd F,H4L,S0 + vpmsumd R,H4M,S0 + + C deferred recombination of partial products + vxor F3,F3,F4 + vxor R3,R3,R4 + vxor F,F,F2 + vxor R,R,R2 + vxor F,F,F3 + vxor R,R,R3 + + C reduction + vpmsumd T,F,POLY_L + xxswapd VSR(D),VSR(F) + vxor R,R,T + vxor D,R,D + + addi DST,DST,4*16 + addi SRC,SRC,4*16 + clrldi LENGTH,LENGTH,58 C 'set the high-order 58 bits to zeros' + +L2x: + C --- process 2 blocks --- + srdi. r14,LENGTH,5 C 'LENGTH / (2 * 16)' + beq L1x + + C load table elements + li r15,1*16 + li r16,2*16 + li r17,3*16 + lxvd2x VSR(H1M),0,TABLE + lxvd2x VSR(H1L),r15,TABLE + lxvd2x VSR(H2M),r16,TABLE + lxvd2x VSR(H2L),r17,TABLE + +L2x_aes: + lxvb16x VSR(K),0,KEYS + + C Increment ctr + vmr C0,CTR + vadduwm CTR,CTR,INC + vmr C1,CTR + vadduwm CTR,CTR,INC + + vxor C0,C0,K + vxor C1,C1,K + + mtctr ROUNDS + li r15,1*16 + +.align 5 +L2x_aes_rnd_loop: + lxvb16x VSR(K),r15,KEYS + vcipher C0,C0,K + vcipher C1,C1,K + addi r15,r15,1*16 + bdnz L2x_aes_rnd_loop + + lxvb16x VSR(K),r15,KEYS + vcipherlast C0,C0,K + vcipherlast C1,C1,K + + C AES(counter) XOR ciphertext = plaintext + li r15,1*16 + lxvb16x VSR(S0),0,SRC + lxvb16x VSR(S1),r15,SRC + vxor C0,C0,S0 + vxor C1,C1,S1 + + C Store plaintext in DST + stxvb16x VSR(C0),0,DST + stxvb16x VSR(C1),r15,DST + +L2x_gcm: + C previous digest combining + vxor S0,S0,D + + C polynomial multiplication + vpmsumd F2,H1L,S1 + vpmsumd R2,H1M,S1 + vpmsumd F,H2L,S0 + vpmsumd R,H2M,S0 + + C deferred recombination of partial products + vxor F,F,F2 + vxor R,R,R2 + + C reduction + vpmsumd T,F,POLY_L + xxswapd VSR(D),VSR(F) + vxor R,R,T + vxor D,R,D + + addi DST,DST,2*16 + addi SRC,SRC,2*16 + clrldi LENGTH,LENGTH,59 C 'set the high-order 59 bits to zeros' + +L1x: + C --- process 1 block --- + srdi. r14,LENGTH,4 C 'LENGTH / (1 * 16)' + beq Lpartial + + C load table elements + li r15,1*16 + lxvd2x VSR(H1M),0,TABLE + lxvd2x VSR(H1L),r15,TABLE + +L1x_aes: + lxvb16x VSR(K),0,KEYS + + C Increment ctr + vmr C0,CTR + vadduwm CTR,CTR,INC + + vxor C0,C0,K + + mtctr ROUNDS + li r15,1*16 + +.align 5 +L1x_aes_rnd_loop: + lxvb16x VSR(K),r15,KEYS + vcipher C0,C0,K + addi r15,r15,1*16 + bdnz L1x_aes_rnd_loop + + lxvb16x VSR(K),r15,KEYS + vcipherlast C0,C0,K + + C AES(counter) XOR ciphertext = plaintext + lxvb16x VSR(S0),0,SRC + vxor C0,C0,S0 + + C Store plaintext in DST + stxvb16x VSR(C0),0,DST + +L1x_gcm: + C previous digest combining + vxor S0,S0,D + + C polynomial multiplication + vpmsumd F,H1L,S0 + vpmsumd R,H1M,S0 + + C reduction + vpmsumd T,F,POLY_L + xxswapd VSR(D),VSR(F) + vxor R,R,T + vxor D,R,D + + addi DST,DST,1*16 + addi SRC,SRC,1*16 + clrldi LENGTH,LENGTH,60 C 'set the high-order 60 bits to zeros' + +Lpartial: + C --- process partial block --- + cmpldi LENGTH,0 + beq Ldone + + C load table elements + li r15,1*16 + lxvd2x VSR(H1M),0,TABLE + lxvd2x VSR(H1L),r15,TABLE + +Lpartial_aes: + lxvb16x VSR(K),0,KEYS + + C Increment ctr + vmr C0,CTR + vadduwm CTR,CTR,INC + + vxor C0,C0,K + + mtctr ROUNDS + li r15,1*16 + +.align 5 +Lpartial_aes_rnd_loop: + lxvb16x VSR(K),r15,KEYS + vcipher C0,C0,K + addi r15,r15,1*16 + bdnz Lpartial_aes_rnd_loop + + lxvb16x VSR(K),r15,KEYS + vcipherlast C0,C0,K + + C Load the partial block left-aligned and zero-padded + sldi LENGTH,LENGTH,56 + lxvll VSR(S0),SRC,LENGTH + + C AES(counter) XOR ciphertext = plaintext + vxor C0,C0,S0 + + C Store plaintext in DST + stxvll VSR(C0),DST,LENGTH + +Lpartial_gcm: + C previous digest combining + vxor S0,S0,D + + C polynomial multiplication + vpmsumd F,H1L,S0 + vpmsumd R,H1M,S0 + + C reduction + vpmsumd T,F,POLY_L + xxswapd VSR(D),VSR(F) + vxor R,R,T + vxor D,R,D + +Ldone: + stxvb16x VSR(D),0,X C store digest 'D' + stxvb16x VSR(CTR),0,PCTR C store updated 'ctr' + + C Restore non-volatiles from the 288B stack redzone + ld r14,-8*1(SP) + ld r15,-8*2(SP) + ld r16,-8*3(SP) + ld r17,-8*4(SP) + lxv VSR(v20),-16*3(SP) + lxv VSR(v21),-16*4(SP) + lxv VSR(v22),-16*5(SP) + lxv VSR(v23),-16*6(SP) + lxv VSR(v24),-16*7(SP) + lxv VSR(v25),-16*8(SP) + lxv VSR(v26),-16*9(SP) + lxv VSR(v27),-16*10(SP) + lxv VSR(v28),-16*11(SP) + lxv VSR(v29),-16*12(SP) + lxv VSR(v30),-16*13(SP) + lxv VSR(v31),-16*14(SP) + + li r3,0 C return 0 for success + blr + +EPILOGUE(_nettle_gcm_aes_decrypt) + +.data +.align 4 +C 0xC2000000000000000000000000000001 +.polynomial: +IF_BE(` + .byte 0xC2 + .rept 14 + .byte 0x00 + .endr + .byte 0x01 +',` + .byte 0x01 + .rept 14 + .byte 0x00 + .endr + .byte 0xC2 +') +.align 4 +.increment: +IF_LE(` + .byte 1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 +') +IF_BE(` + .byte 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1 +')
Enable runtime override via FAT for gcm_aes_{de,en}crypt() on ppc ISA 3.0 (P9 and beyond) platforms.
Signed-off-by: Christopher M. Riedl cmr@linux.ibm.com --- fat-ppc.c | 33 +++++++++++++++++++++++++++ fat-setup.h | 6 +++++ gcm-internal.h | 14 ++++++++++++ powerpc64/fat/gcm-aes-decrypt.asm | 37 +++++++++++++++++++++++++++++++ powerpc64/fat/gcm-aes-encrypt.asm | 37 +++++++++++++++++++++++++++++++ 5 files changed, 127 insertions(+) create mode 100644 powerpc64/fat/gcm-aes-decrypt.asm create mode 100644 powerpc64/fat/gcm-aes-encrypt.asm
diff --git a/fat-ppc.c b/fat-ppc.c index 67ef46ab..9cc9d526 100644 --- a/fat-ppc.c +++ b/fat-ppc.c @@ -173,6 +173,14 @@ DECLARE_FAT_FUNC_VAR(gcm_init_key, gcm_init_key_func, ppc64) DECLARE_FAT_FUNC(_nettle_gcm_hash, gcm_hash_func) DECLARE_FAT_FUNC_VAR(gcm_hash, gcm_hash_func, c) DECLARE_FAT_FUNC_VAR(gcm_hash, gcm_hash_func, ppc64) + +DECLARE_FAT_FUNC(_nettle_gcm_aes_encrypt, gcm_aes_crypt_func) +DECLARE_FAT_FUNC_VAR(gcm_aes_encrypt, gcm_aes_crypt_func, c) +DECLARE_FAT_FUNC_VAR(gcm_aes_encrypt, gcm_aes_crypt_func, ppc64) + +DECLARE_FAT_FUNC(_nettle_gcm_aes_decrypt, gcm_aes_crypt_func) +DECLARE_FAT_FUNC_VAR(gcm_aes_decrypt, gcm_aes_crypt_func, c) +DECLARE_FAT_FUNC_VAR(gcm_aes_decrypt, gcm_aes_crypt_func, ppc64) #endif /* GCM_TABLE_BITS == 8 */
DECLARE_FAT_FUNC(_nettle_chacha_core, chacha_core_func) @@ -238,6 +246,20 @@ fat_init (void) nettle_chacha_crypt_vec = _nettle_chacha_crypt_1core; nettle_chacha_crypt32_vec = _nettle_chacha_crypt32_1core; } + if (features.have_isa_30) + { + if (verbose) + fprintf (stderr, "libnettle: enabling arch 3.0 code.\n"); +#if GCM_TABLE_BITS == 8 + _nettle_gcm_aes_encrypt_vec = _nettle_gcm_aes_encrypt_ppc64; + _nettle_gcm_aes_decrypt_vec = _nettle_gcm_aes_decrypt_ppc64; +#endif /* GCM_TABLE_BITS == 8 */ + } + else + { + _nettle_gcm_aes_encrypt_vec = _nettle_gcm_aes_encrypt_c; + _nettle_gcm_aes_decrypt_vec = _nettle_gcm_aes_decrypt_c; + } }
DEFINE_FAT_FUNC(_nettle_aes_encrypt, void, @@ -263,6 +285,17 @@ DEFINE_FAT_FUNC(_nettle_gcm_hash, void, (const struct gcm_key *key, union nettle_block16 *x, size_t length, const uint8_t *data), (key, x, length, data)) + +DEFINE_FAT_FUNC(_nettle_gcm_aes_encrypt, int, + (const struct gcm_key *key, union nettle_block16 *x, + size_t length, const uint8_t *src, unsigned rounds, + const uint32_t *keys, uint8_t *dst, uint8_t* ctr), + (key, x, length, src, rounds, keys, dst, ctr)) +DEFINE_FAT_FUNC(_nettle_gcm_aes_decrypt, int, + (const struct gcm_key *key, union nettle_block16 *x, + size_t length, const uint8_t *src, unsigned rounds, + const uint32_t *keys, uint8_t *dst, uint8_t* ctr), + (key, x, length, src, rounds, keys, dst, ctr)) #endif /* GCM_TABLE_BITS == 8 */
DEFINE_FAT_FUNC(_nettle_chacha_core, void, diff --git a/fat-setup.h b/fat-setup.h index 4e528d6b..70c271e5 100644 --- a/fat-setup.h +++ b/fat-setup.h @@ -194,3 +194,9 @@ typedef void chacha_crypt_func(struct chacha_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src); + +typedef int gcm_aes_crypt_func(const struct gcm_key *key, + union nettle_block16 *x, size_t length, + const uint8_t *src, unsigned rounds, + const uint32_t *keys, uint8_t *dst, + uint8_t* ctr); diff --git a/gcm-internal.h b/gcm-internal.h index 2e28be2d..63373d95 100644 --- a/gcm-internal.h +++ b/gcm-internal.h @@ -51,4 +51,18 @@ _nettle_gcm_hash_c (const struct gcm_key *key, union nettle_block16 *x, size_t length, const uint8_t *data); #endif
+#if HAVE_NATIVE_fat_gcm_aes_encrypt +int +_nettle_gcm_aes_encrypt_c (const struct gcm_key *key, union nettle_block16 *x, + size_t length, const uint8_t *src, unsigned rounds, + const uint32_t *keys, uint8_t *dst, uint8_t* ctr); +#endif + +#if HAVE_NATIVE_fat_gcm_aes_decrypt +int +_nettle_gcm_aes_decrypt_c (const struct gcm_key *key, union nettle_block16 *x, + size_t length, const uint8_t *src, unsigned rounds, + const uint32_t *keys, uint8_t *dst, uint8_t* ctr); +#endif + #endif /* NETTLE_GCM_INTERNAL_H_INCLUDED */ diff --git a/powerpc64/fat/gcm-aes-decrypt.asm b/powerpc64/fat/gcm-aes-decrypt.asm new file mode 100644 index 00000000..a6bd2e36 --- /dev/null +++ b/powerpc64/fat/gcm-aes-decrypt.asm @@ -0,0 +1,37 @@ +C powerpc64/fat/gcm-aes-decrypt.asm + +ifelse(` + Copyright (C) 2021 Christopher M. Riedl + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + +dnl picked up by configure +dnl PROLOGUE(_nettle_fat_gcm_aes_decrypt) + +define(`fat_transform', `$1_ppc64') +include_src(`powerpc64/p9/gcm-aes-decrypt.asm') diff --git a/powerpc64/fat/gcm-aes-encrypt.asm b/powerpc64/fat/gcm-aes-encrypt.asm new file mode 100644 index 00000000..1cffce9d --- /dev/null +++ b/powerpc64/fat/gcm-aes-encrypt.asm @@ -0,0 +1,37 @@ +C powerpc64/fat/gcm-aes-encrypt.asm + +ifelse(` + Copyright (C) 2021 Christopher M. Riedl + + This file is part of GNU Nettle. + + GNU Nettle is free software: you can redistribute it and/or + modify it under the terms of either: + + * the GNU Lesser General Public License as published by the Free + Software Foundation; either version 3 of the License, or (at your + option) any later version. + + or + + * the GNU General Public License as published by the Free + Software Foundation; either version 2 of the License, or (at your + option) any later version. + + or both in parallel, as here. + + GNU Nettle is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU + General Public License for more details. + + You should have received copies of the GNU General Public License and + the GNU Lesser General Public License along with this program. If + not, see http://www.gnu.org/licenses/. +') + +dnl picked up by configure +dnl PROLOGUE(_nettle_fat_gcm_aes_encrypt) + +define(`fat_transform', `$1_ppc64') +include_src(`powerpc64/p9/gcm-aes-encrypt.asm')
"Christopher M. Riedl" cmr@linux.ibm.com writes:
An implementation combining AES+GCM _can potentially_ yield significant performance boosts by allowing for increased instruction parallelism, avoiding C-function call overhead, more flexibility in assembly fine-tuning, etc. This series provides such an implementation based on the existing optimized Nettle routines for POWER9 and later processors. Benchmark results on a POWER9 Blackbird running at 3.5GHz are given at the end of this mail.
Benchmark results are impressive. If I get the numbers right, cycles per block (16 bytes) is reduced from 40 to 22.5. You can run nettle-benchmark with the flag -f 3.5e9 (for 3.5GHz clock frequency) to get cycle numbers in the output.
I'm a bit conservative about about adding assembly code for combined operations, since it can lead to an explosion in the amount of code to maintain. So I'd like to understand a bit better where the 17.5 saved cycles were spent. For the code on master, gcm_encrypt (with aes) is built from these building blocks:
* gcm_fill
C code, essentially 2 64-bit stores per block. On little endian, it also needs some byte swapping.
* aes_encrypt
Using power assembly. Performance measured as the "aes128 ECB encrypt" line in nettle-benchmark output.
* memxor3
This is C code on power (and rather hairy C code). Performance can be measured with nettle-benchmark, and it's going to be a bit alignment dependent.
* gcm_hash
This uses power assembly. Performance is measured as the "gcm update" line in nettle-benchmark output. From your numbers, this seems to be 7.3 cycles per block.
So before going all the way with a combined aes_gcm function, I think it's good to try to optimize the building blocks. Please benchmark memxor3, to see if it could benefit from assembly implementation. If so, that should give a nice speedup to several modes, not just gcm. (If you implement memxor3, beware that it needs to support some overlap, to not break in-place CBC decrypt).
Another potential overhead is that data is stored to memory when passed between these functions. It seems we store a block 3 times, and loads a block 4 times (the additional accesses should be cache friendly, but wills till cost some execution resources). Optimizing that seems to need some kind of combined function. But maybe it is sufficient to optimize something a bit more general than aes gcm, e.g., aes ctr?
Regards, /Niels
On Mon Apr 5, 2021 at 2:39 AM CDT, Niels Möller wrote:
"Christopher M. Riedl" cmr@linux.ibm.com writes:
An implementation combining AES+GCM _can potentially_ yield significant performance boosts by allowing for increased instruction parallelism, avoiding C-function call overhead, more flexibility in assembly fine-tuning, etc. This series provides such an implementation based on the existing optimized Nettle routines for POWER9 and later processors. Benchmark results on a POWER9 Blackbird running at 3.5GHz are given at the end of this mail.
Benchmark results are impressive. If I get the numbers right, cycles per block (16 bytes) is reduced from 40 to 22.5. You can run nettle-benchmark with the flag -f 3.5e9 (for 3.5GHz clock frequency) to get cycle numbers in the output.
Hi Niels,
Your math is very close - here are benchmark results for encrypt (since decrypt is essentially the same):
AES+GCM combined (this series) ------------------------------ Algorithm mode Mbyte/s cycles/byte cycles/block gcm_aes128 encrypt 2564.32 1.30 20.83 gcm_aes192 encrypt 2276.86 1.47 23.46 gcm_aes256 encrypt 2051.87 1.63 26.03
AES,GCM separate (nettle master) -------------------------------- Algorithm mode Mbyte/s cycles/byte cycles/block gcm_aes128 encrypt 1419.17 2.35 37.63 gcm_aes192 encrypt 1313.69 2.54 40.65 gcm_aes256 encrypt 1218.79 2.74 43.82
So for aes128: 37.63 - 20.83 = 16.80 cycles/block improvement.
I'm a bit conservative about about adding assembly code for combined operations, since it can lead to an explosion in the amount of code to maintain. So I'd like to understand a bit better where the 17.5 saved cycles were spent. For the code on master, gcm_encrypt (with aes) is built from these building blocks:
Makes perfect sense to me!
- gcm_fill
C code, essentially 2 64-bit stores per block. On little endian, it also needs some byte swapping.
- aes_encrypt
Using power assembly. Performance measured as the "aes128 ECB encrypt" line in nettle-benchmark output.
- memxor3
This is C code on power (and rather hairy C code). Performance can be measured with nettle-benchmark, and it's going to be a bit alignment dependent.
- gcm_hash
This uses power assembly. Performance is measured as the "gcm update" line in nettle-benchmark output. From your numbers, this seems to be 7.3 cycles per block.
So before going all the way with a combined aes_gcm function, I think it's good to try to optimize the building blocks. Please benchmark memxor3, to see if it could benefit from assembly implementation. If so, that should give a nice speedup to several modes, not just gcm. (If you implement memxor3, beware that it needs to support some overlap, to not break in-place CBC decrypt).
The benchmark results don't convince me memxor3 and memxor are actually a huge bottleneck by themselves. It does appear to show that my combined implementation is dominated by the cost of AES (which matches when I run a simple test encrypt program with the 'perf' utility):
Algorithm mode Mbyte/s cycles/byte cycles/block memxor aligned 16634.14 0.20 1.61 memxor unaligned 11089.33 0.30 2.41 memxor3 aligned 17261.19 0.19 1.55 memxor3 unaligned01 11549.04 0.29 2.31 memxor3 unaligned11 11181.62 0.30 2.39 memxor3 unaligned12 8485.88 0.39 3.15 aes128 ECB encrypt 2762.38 1.21 19.33 aes128 ECB decrypt 2203.65 1.51 24.24
I tried a few other experiments:
1. Replace memxor/3 with a no-op function (ie. just 'return'): Algorithm mode Mbyte/s cycles/byte cycles/block gcm_aes128 encrypt 1553.08 2.15 34.39 gcm_aes192 encrypt 1428.57 2.34 37.38 gcm_aes256 encrypt 1318.05 2.53 40.52
aes128: 37.63 - 34.39 = 3.24 cycles/block
2. Replace memxor/3,gcm_fill w/ a no-op function: Algorithm mode Mbyte/s cycles/byte cycles/block gcm_aes128 encrypt 1793.37 1.86 29.78 gcm_aes192 encrypt 1625.74 2.05 32.85 gcm_aes256 encrypt 1483.81 2.25 35.99
aes128: 34.39 - 29.78 = 4.61 cycles/block
3. Replace memxor/3, and gcm_fill w/ no-op functions and use POWER9 instructions lxvb16x/stxvb16x to load/store unaligned vectors and avoid the permutes on LE: Algorithm mode Mbyte/s cycles/byte cycles/block gcm_aes128 encrypt 2069.67 1.61 25.80 gcm_aes192 encrypt 1875.97 1.78 28.47 gcm_aes256 encrypt 1717.33 1.94 31.10
aes128: 29.78 - 25.80 = 3.98 cycles/block
So in total, if we assume an ideal (but impossible) zero-cost version for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0 vector load/stores we can only account for 11.82 cycles/block; leaving 4.97 cycles/block as an additional benefit of the combined implementation. But all this assumes zero-cost implementations of these building blocks so the improvement of the combined implementation is >5 cycles/block.
Another potential overhead is that data is stored to memory when passed between these functions. It seems we store a block 3 times, and loads a block 4 times (the additional accesses should be cache friendly, but wills till cost some execution resources). Optimizing that seems to need some kind of combined function. But maybe it is sufficient to optimize something a bit more general than aes gcm, e.g., aes ctr?
This would basically have to replace the nettle_crypt16 function call with arch-specific assembly, right? I can code this up and try it out in the context of AES-GCM.
Thanks! Chris R.
Regards, /Niels
-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.
"Christopher M. Riedl" cmr@linux.ibm.com writes:
So in total, if we assume an ideal (but impossible) zero-cost version for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0 vector load/stores we can only account for 11.82 cycles/block; leaving 4.97 cycles/block as an additional benefit of the combined implementation.
One hypothesis for that gain is that we can avoid storing the aes input in memory at all; instead, generated the counter values on the fly in the appropriate registers.
Another potential overhead is that data is stored to memory when passed between these functions. It seems we store a block 3 times, and loads a block 4 times (the additional accesses should be cache friendly, but wills till cost some execution resources). Optimizing that seems to need some kind of combined function. But maybe it is sufficient to optimize something a bit more general than aes gcm, e.g., aes ctr?
This would basically have to replace the nettle_crypt16 function call with arch-specific assembly, right? I can code this up and try it out in the context of AES-GCM.
Yes, something like that. If we leave the _nettle_gcm_hash unchanged (with its own independent assembly implementation), and look at gcm_encrypt, what we have is
const void *cipher, nettle_cipher_func *f,
_nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src);
It would be nice if we could replace that with a call to aes_ctr_crypt, and then optimizing that would benefit both gcm and plain ctr. But it's not quite that easy, because gcm unfortunately uses it's own variant of ctr mode, which is why we need to pass the gcm_fill function in the first place.
So if we need separate assembly for aes_plain_ctr and aes_gcm_ctr (they *might* still share some code, but they would be distinct entry points). Say we call the gcm-specific ctr function from some variant of gcm_encrypt via a different function pointer. Then that gcm_encrypt variant is getting a bit pointless. Maybe it's better to do
void aes128_gcm_encrypt(...) { _nettle_aes128_gcm_ctr(...); _nettle_gcm_hash(...); }
At least, we avoid duplicating the _gcm_hash for aes128, aes192, aes256 (and any other algorithms we might want to optimize in a similar way). And each of the aes assembly routines should be fairly small and easy to maintain.
I wonder if there are any reasonable alternatives with similar performance? One idea that occurs to me is to replace the role of gcm_fill function (and the nettle_fill16_fOBunc type) with an arch-specific assembly only hook-interface that gets inputs in specified registers, and is expected to produce the next cipher input in registers.
We could then have a aes128_any_encrypt that takes the same args as aes128_encrypt + a pointer to such a magic assembly function.
The aes128_any_encrypt assembly would then put required input in the right registers (address of clear text, current counter block, previous ciphertext block, etc) and have a loop where each iteration calls the hook, and encrypts a block from registers.
But I'm afraid it's not going to be so easy, given that where possible (i.e., all modes but cbc encrypt) would like to have the option to do multiple blocks in parallell. Perhaps better to have an assembly interface to functions doing ECB on one block, two blocks, three blocks (if there are sufficient number of registers), etc, in registers, and call that from the other assembly functions. A bit like the recent chacha_Ncore functions, but with input and output output in registers rather than stored in memory.
Regards, /Niels
On Thu, May 20, 2021 at 10:06 PM Niels Möller nisse@lysator.liu.se wrote:
"Christopher M. Riedl" cmr@linux.ibm.com writes:
So in total, if we assume an ideal (but impossible) zero-cost version for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0 vector load/stores we can only account for 11.82 cycles/block; leaving 4.97 cycles/block as an additional benefit of the combined implementation.
One hypothesis for that gain is that we can avoid storing the aes input in memory at all; instead, generated the counter values on the fly in the appropriate registers.
Another potential overhead is that data is stored to memory when passed between these functions. It seems we store a block 3 times, and loads a block 4 times (the additional accesses should be cache friendly, but wills till cost some execution resources). Optimizing that seems to need some kind of combined function. But maybe it is sufficient to optimize something a bit more general than aes gcm, e.g., aes ctr?
This would basically have to replace the nettle_crypt16 function call with arch-specific assembly, right? I can code this up and try it out in the context of AES-GCM.
Yes, something like that. If we leave the _nettle_gcm_hash unchanged (with its own independent assembly implementation), and look at gcm_encrypt, what we have is
const void *cipher, nettle_cipher_func *f,
_nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src);
It would be nice if we could replace that with a call to aes_ctr_crypt, and then optimizing that would benefit both gcm and plain ctr. But it's not quite that easy, because gcm unfortunately uses it's own variant of ctr mode, which is why we need to pass the gcm_fill function in the first place.
So if we need separate assembly for aes_plain_ctr and aes_gcm_ctr (they *might* still share some code, but they would be distinct entry points). Say we call the gcm-specific ctr function from some variant of gcm_encrypt via a different function pointer. Then that gcm_encrypt variant is getting a bit pointless. Maybe it's better to do
void aes128_gcm_encrypt(...) { _nettle_aes128_gcm_ctr(...); _nettle_gcm_hash(...); }
At least, we avoid duplicating the _gcm_hash for aes128, aes192, aes256 (and any other algorithms we might want to optimize in a similar way). And each of the aes assembly routines should be fairly small and easy to maintain.
While writing the white paper "Optimize AES-GCM for PowerPC architecture processors", I concluded that is the best approach to implement for PowerPC architecture, easy to maintain, avoid duplication, and perform well. I've separated aes_gcm encrypt/decrypt to two functions, aes_ctr and ghash. Both implemented using Power ISA v3.00 assisted with vector-scalar registers. I got 1.18 cycles/byte for gcm-aes-128 encrypt/decrypt, 1.31 cycles/byte for gcm-aes-192 encrypt/decrypt, and 1.44 cycles/byte for gcm-aes-256 encrypt/decrypt.
Still if there are additional vector registers, I would give the combined function a shot as it eliminates loading the input message twice.
regards, Mamone
On Thu May 20, 2021 at 3:59 PM EDT, Maamoun TK wrote:
On Thu, May 20, 2021 at 10:06 PM Niels Möller nisse@lysator.liu.se wrote:
"Christopher M. Riedl" cmr@linux.ibm.com writes:
So in total, if we assume an ideal (but impossible) zero-cost version for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0 vector load/stores we can only account for 11.82 cycles/block; leaving 4.97 cycles/block as an additional benefit of the combined implementation.
One hypothesis for that gain is that we can avoid storing the aes input in memory at all; instead, generated the counter values on the fly in the appropriate registers.
Another potential overhead is that data is stored to memory when passed between these functions. It seems we store a block 3 times, and loads a block 4 times (the additional accesses should be cache friendly, but wills till cost some execution resources). Optimizing that seems to need some kind of combined function. But maybe it is sufficient to optimize something a bit more general than aes gcm, e.g., aes ctr?
This would basically have to replace the nettle_crypt16 function call with arch-specific assembly, right? I can code this up and try it out in the context of AES-GCM.
Yes, something like that. If we leave the _nettle_gcm_hash unchanged (with its own independent assembly implementation), and look at gcm_encrypt, what we have is
const void *cipher, nettle_cipher_func *f,
_nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src);
It would be nice if we could replace that with a call to aes_ctr_crypt, and then optimizing that would benefit both gcm and plain ctr. But it's not quite that easy, because gcm unfortunately uses it's own variant of ctr mode, which is why we need to pass the gcm_fill function in the first place.
So if we need separate assembly for aes_plain_ctr and aes_gcm_ctr (they *might* still share some code, but they would be distinct entry points). Say we call the gcm-specific ctr function from some variant of gcm_encrypt via a different function pointer. Then that gcm_encrypt variant is getting a bit pointless. Maybe it's better to do
void aes128_gcm_encrypt(...) { _nettle_aes128_gcm_ctr(...); _nettle_gcm_hash(...); }
At least, we avoid duplicating the _gcm_hash for aes128, aes192, aes256 (and any other algorithms we might want to optimize in a similar way). And each of the aes assembly routines should be fairly small and easy to maintain.
While writing the white paper "Optimize AES-GCM for PowerPC architecture processors", I concluded that is the best approach to implement for PowerPC architecture, easy to maintain, avoid duplication, and perform well. I've separated aes_gcm encrypt/decrypt to two functions, aes_ctr and ghash. Both implemented using Power ISA v3.00 assisted with vector-scalar registers. I got 1.18 cycles/byte for gcm-aes-128 encrypt/decrypt, 1.31 cycles/byte for gcm-aes-192 encrypt/decrypt, and 1.44 cycles/byte for gcm-aes-256 encrypt/decrypt.
Neat, did you base that on the aes-gcm combined series I posted here or completely different/new code?
Still if there are additional vector registers, I would give the combined function a shot as it eliminates loading the input message twice.
regards, Mamone
On Tue, Jun 1, 2021 at 11:21 PM Christopher M. Riedl cmr@linux.ibm.com wrote:
On Thu May 20, 2021 at 3:59 PM EDT, Maamoun TK wrote:
On Thu, May 20, 2021 at 10:06 PM Niels Möller nisse@lysator.liu.se wrote:
"Christopher M. Riedl" cmr@linux.ibm.com writes:
So in total, if we assume an ideal (but impossible) zero-cost version for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0
vector
load/stores we can only account for 11.82 cycles/block; leaving 4.97 cycles/block as an additional benefit of the combined implementation.
One hypothesis for that gain is that we can avoid storing the aes input in memory at all; instead, generated the counter values on the fly in the appropriate registers.
Another potential overhead is that data is stored to memory when
passed
between these functions. It seems we store a block 3 times, and
loads a
block 4 times (the additional accesses should be cache friendly, but wills till cost some execution resources). Optimizing that seems to
need
some kind of combined function. But maybe it is sufficient to
optimize
something a bit more general than aes gcm, e.g., aes ctr?
This would basically have to replace the nettle_crypt16 function call with arch-specific assembly, right? I can code this up and try it
out in
the context of AES-GCM.
Yes, something like that. If we leave the _nettle_gcm_hash unchanged (with its own independent assembly implementation), and look at gcm_encrypt, what we have is
const void *cipher, nettle_cipher_func *f,
_nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst,
src);
It would be nice if we could replace that with a call to aes_ctr_crypt, and then optimizing that would benefit both gcm and plain ctr. But it's not quite that easy, because gcm unfortunately uses it's own variant of ctr mode, which is why we need to pass the gcm_fill function in the first place.
So if we need separate assembly for aes_plain_ctr and aes_gcm_ctr (they *might* still share some code, but they would be distinct entry
points).
Say we call the gcm-specific ctr function from some variant of gcm_encrypt via a different function pointer. Then that gcm_encrypt variant is getting a bit pointless. Maybe it's better to do
void aes128_gcm_encrypt(...) { _nettle_aes128_gcm_ctr(...); _nettle_gcm_hash(...); }
At least, we avoid duplicating the _gcm_hash for aes128, aes192, aes256 (and any other algorithms we might want to optimize in a similar way). And each of the aes assembly routines should be fairly small and easy
to
maintain.
While writing the white paper "Optimize AES-GCM for PowerPC architecture processors", I concluded that is the best approach to implement for PowerPC architecture, easy to maintain, avoid duplication, and perform well. I've separated aes_gcm encrypt/decrypt to two functions, aes_ctr and ghash. Both implemented using Power ISA v3.00 assisted with vector-scalar registers. I got 1.18 cycles/byte for gcm-aes-128 encrypt/decrypt, 1.31 cycles/byte for gcm-aes-192 encrypt/decrypt, and 1.44 cycles/byte for gcm-aes-256 encrypt/decrypt.
Neat, did you base that on the aes-gcm combined series I posted here or completely different/new code?
It's based on new code written to fit the paper context.
nettle-bugs@lists.lysator.liu.se