Currently ghash/gcm performance on arm in both gcrypt and nettle is a bit abysmal: === bench-slopes-nettle === GCM auth | 28.43 ns/B 33.54 MiB/s 39.81 c/B 1400.2 === bench-slopes-gcrypt === GCM auth | 21.86 ns/B 43.62 MiB/s 30.52 c/B 1396.0 === bench-slopes-openssl [1.1.1a] === GCM auth | 5.99 ns/B 159.3 MiB/s 8.38 c/B 1399.6 === cut === Current openssl/cryptograms code is based on ideas from https://hal.inria.fr/hal-01506572 (licensed CC BY 4.0) and there are linked implementation https://conradoplg.cryptoland.net/software/ecc-and-ae-for-arm-neon/ (licensed LGPL 2.1+), which I guess should be acceptable to borrow.
Very preliminary patch for nettle will be posted as reply (passes nettle regression test, but needs more extensive testing); === bench-slopes-nettle [w/ patched nettle 3.3] === aes128 | nanosecs/byte mebibytes/sec cycles/byte GCM auth | 7.07 ns/B 134.9 MiB/s 9.90 c/B === cut === (And not only it is notably faster, it should be completely free of all cache/timing leaks).
v2: avoid expensive trap on unaligned LDM, reshuffled some insns
"Yuriy M. Kaminskiy" yumkam@gmail.com writes:
From fa19a36985b7554517e9122b4cd193cd1a9c4f0e Mon Sep 17 00:00:00 2001 From: "Yuriy M. Kaminskiy" yumkam@gmail.com Date: Sun, 10 Mar 2019 11:08:46 +0300 Subject: [PATCH] Add fast constant-time ARM NEON ghash/gcm
Based on code from https://conradoplg.cryptoland.net/software/ecc-and-ae-for-arm-neon/ and https://hal.inria.fr/hal-01506572 Note: arm->neon is fast, neon->arm slow, so we delay bitreverse (performed in arm) as much as possible and keep ctx->x and ctx->key bitreversed.
Thanks! I think I looked at the paper at some point, and it's clever. Some initial comments.
Regarding bit-reversal, I think carryless multiplication is symmetric under bitreversal (reversing the two 8-bit inputs corresponds to bit-reversal of the 15-bit product), so unless input and output for some reason uses different bitorder, I hope it should be possible to do any needed bit reversal at key-setup only.
+.macro MUL64k3t4 rq rl rh ad bd k16 k32 k48 t0q t0l t0h t1q t1l t1h t2q t2l t2h t3q t3l t3h
Could you do these as m4 macros, like in the rest of the Nettle asm code?
Regards, /Niels
On Sun, 2019-03-10 at 11:38 +0300, Yuriy M. Kaminskiy wrote:
Currently ghash/gcm performance on arm in both gcrypt and nettle is a bit abysmal: === bench-slopes-nettle === GCM auth | 28.43 ns/B 33.54 MiB/s 39.81 c/B 1400.2 === bench-slopes-gcrypt === GCM auth | 21.86 ns/B 43.62 MiB/s 30.52 c/B 1396.0 === bench-slopes-openssl [1.1.1a] === GCM auth | 5.99 ns/B 159.3 MiB/s 8.38 c/B 1399.6 === cut === Current openssl/cryptograms code is based on ideas from https://hal.inria.fr/hal-01506572 (licensed CC BY 4.0) and there are linked implementation https://conradoplg.cryptoland.net/software/ecc-and-ae-for-arm-neon/ (licensed LGPL 2.1+), which I guess should be acceptable to borrow.
Very preliminary patch for nettle will be posted as reply (passes nettle regression test, but needs more extensive testing); === bench-slopes-nettle [w/ patched nettle 3.3] === aes128 | nanosecs/byte mebibytes/sec cycles/byte GCM auth | 7.07 ns/B 134.9 MiB/s 9.90 c/B === cut === (And not only it is notably faster, it should be completely free of all cache/timing leaks).
Thank you for that!
regards, Nikos
nettle-bugs@lists.lysator.liu.se