Hello Niels,
On Tue, Feb 02, 2021 at 06:09:42PM +0100, Niels Möller wrote:
I've downloaded binary builds of clang for aarch64 from https://releases.llvm.org/download.html. 3.9.1 was the oldest prebuilt toolchain I could find there and 11.0.0 the most recent.
[...]
They also all support the .arch directive:
$ cat t.s .arch armv8-a+crypto pmull v2.1q, v2.1d, v1.1d $ aarch64-unknown-linux-gnu-as -o t.o t.s $ clang+llvm-3.9.1-aarch64-linux-gnu/bin/clang -c -o t.o t.s $ clang+llvm-11.0.0-aarch64-linux-gnu/bin/clang -c -o t.o t.s
Thanks for investigating. The .arch pseudoop it is, then.
I've pushed a change to use that, instead of modifying CFLAGS.
The arm64 branch builds and passes the testsuite on aarch64 and aarch64_be with gcc 10.2 and clang 11.0.1 with and without the optimized assembly routines on my pine64 boards. This is with the .arch directive instead of modifying CFLAGS and the new configure option name --enable-arm64-crypto.
Out of curiosity I've also collected some benchmark numbers for gcm_aes256. (Is that a correct and sensible algorithm for that purpose?)
The speedup from using pmull seems to be around 35% for encrypt/decrypt.
Interestingly, LE is about a cycle per block faster than BE even though it should have quite a few more rev64s to execute than BE. Could this be masked by memory accesses, pipelining or scheduling?
How is the massive speedup in update to be interpreted and that BE here is indeed quite a bit faster than LE? Do I understand correctly that on update only GCM is run on unencrypted data for authentication purposes so that this number really indicates the pure GCM pmull speedup? If so, it would indicate 19-fold speedup and an 8.6% advantage to BE.
What's also curious is that the system's openssl 1.1.1i is consistenly reported an order of magnitude faster than nettle. I guess the major factor is that there's no optimized AES for aarch64 yet in nettle which openssl seems to have. So I built an openssl 1.1.1i without assembly which produced the last benchmark which would support that.
cat /sys/devices/system/cpu/cpufreq/policy0/scaling_governor performance cat /sys/devices/system/cpu/cpufreq/policy0/cpuinfo_max_freq 1152000 LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 1.152e9 gcm_aes256
Algorithm mode Mbyte/s cycles/byte cycles/block
aarch64-le gcc 10.2 with arm64-cypto: gcm_aes256 encrypt 29.42 37.34 597.41 gcm_aes256 decrypt 29.43 37.34 597.36 gcm_aes256 update 1417.32 0.78 12.40
openssl gcm_aes256 encrypt 391.93 2.80 44.85 openssl gcm_aes256 decrypt 392.35 2.80 44.80 openssl gcm_aes256 update 1246.04 0.88 14.11
aarch64-be gcc 10.2 with arm64-cypto: gcm_aes256 encrypt 29.35 37.43 598.82 gcm_aes256 decrypt 29.36 37.42 598.77 gcm_aes256 update 1540.34 0.71 11.41
openssl gcm_aes256 encrypt 398.96 2.75 44.06 openssl gcm_aes256 decrypt 397.66 2.76 44.20 openssl gcm_aes256 update 1306.05 0.84 13.46
aarch64-le clang 11.0.1 with arm64-cypto: gcm_aes256 encrypt 28.76 38.20 611.15 gcm_aes256 decrypt 28.76 38.19 611.10 gcm_aes256 update 1416.17 0.78 12.41
openssl gcm_aes256 encrypt 392.32 2.80 44.81 openssl gcm_aes256 decrypt 392.35 2.80 44.80 openssl gcm_aes256 update 1247.72 0.88 14.09
aarch64-be clang 11.0.1 with arm64-cypto: gcm_aes256 encrypt 28.70 38.28 612.53 gcm_aes256 decrypt 28.69 38.29 612.59 gcm_aes256 update 1543.87 0.71 11.39
openssl gcm_aes256 encrypt 399.46 2.75 44.00 openssl gcm_aes256 decrypt 398.90 2.75 44.07 openssl gcm_aes256 update 1317.87 0.83 13.34
aarch64-le gcc 10.2 without arm64-cypto: gcm_aes256 encrypt 21.43 51.27 820.28 gcm_aes256 decrypt 21.43 51.27 820.30 gcm_aes256 update 74.39 14.77 236.30
openssl gcm_aes256 encrypt 391.93 2.80 44.85 openssl gcm_aes256 decrypt 392.17 2.80 44.82 openssl gcm_aes256 update 1245.13 0.88 14.12
aarch64-be gcc 10.2 without arm64-cypto: gcm_aes256 encrypt 21.71 50.60 809.58 gcm_aes256 decrypt 21.72 50.59 809.43 gcm_aes256 update 79.01 13.90 222.47
openssl gcm_aes256 encrypt 398.43 2.76 44.12 openssl gcm_aes256 decrypt 398.67 2.76 44.09 openssl gcm_aes256 update 1309.52 0.84 13.42
aarch64-le clang 11.0.1 without arm64-cypto: gcm_aes256 encrypt 18.98 57.89 926.29 gcm_aes256 decrypt 18.98 57.89 926.22 gcm_aes256 update 53.67 20.47 327.53
openssl gcm_aes256 encrypt 392.16 2.80 44.82 openssl gcm_aes256 decrypt 392.17 2.80 44.82 openssl gcm_aes256 update 1248.30 0.88 14.08
aarch64-be clang 11.0.1 without arm64-cypto: gcm_aes256 encrypt 18.89 58.16 930.49 gcm_aes256 decrypt 18.85 58.28 932.54 gcm_aes256 update 53.67 20.47 327.53
openssl gcm_aes256 encrypt 399.36 2.75 44.02 openssl gcm_aes256 decrypt 398.87 2.75 44.07 openssl gcm_aes256 update 1318.44 0.83 13.33
aarch64-be gcc 10.2 without arm64-crypto and with no-asm openssl: LD_LIBRARY_PATH=../../openssl-1.1.1i:../.lib ./nettle-benchmark -f 1.152e9 gcm_aes256
Algorithm mode Mbyte/s cycles/byte cycles/block
gcm_aes256 encrypt 21.72 50.59 809.43 gcm_aes256 decrypt 21.72 50.59 809.45 gcm_aes256 update 79.02 13.90 222.45
openssl gcm_aes256 encrypt 21.06 52.17 834.70 openssl gcm_aes256 decrypt 21.34 51.49 823.82 openssl gcm_aes256 update 56.18 19.55 312.87
x86_64 Intel Skylake laptop gcc 10.2 fat as sanity check: NETTLE_FAT_VERBOSE=1 LD_LIBRARY_PATH=../.lib ./nettle-benchmark -f 4.6e9 aes256 libnettle: fat library initialization. libnettle: cpu features: vendor:intel,aesni libnettle: using aes instructions. libnettle: not using sha_ni instructions. libnettle: intel SSE2 will be used for memxor. sha1_compress: 209.50 cycles salsa20_core: 205.70 cycles sha3_permute: 918.50 cycles (38.27 / round)
Algorithm mode Mbyte/s cycles/byte cycles/block
aes256 ECB encrypt 4856.60 0.90 14.45 aes256 ECB decrypt 4800.03 0.91 14.62 aes256 CBC encrypt 889.91 4.93 78.87 aes256 CBC decrypt 4331.24 1.01 16.21 aes256 (in-place) 3516.29 1.25 19.96 aes256 CTR 3131.58 1.40 22.41 aes256 (in-place) 2826.07 1.55 24.84
openssl aes256 ECB encrypt 4840.40 0.91 14.50 openssl aes256 ECB decrypt 4835.88 0.91 14.51
gcm_aes256 encrypt 585.60 7.49 119.86 gcm_aes256 decrypt 585.29 7.50 119.92 gcm_aes256 update 697.69 6.29 100.60
openssl gcm_aes256 encrypt 4499.49 0.97 15.60 openssl gcm_aes256 decrypt 4498.84 0.98 15.60 openssl gcm_aes256 update 11383.81 0.39 6.17
Just out of curiosity: I assume there's no aesni-pmull-like GCM implementation for x86_64?