Michael Weiser michael.weiser@gmx.de writes:
The arm64 branch builds and passes the testsuite on aarch64 and aarch64_be with gcc 10.2 and clang 11.0.1 with and without the optimized assembly routines on my pine64 boards. This is with the .arch directive instead of modifying CFLAGS and the new configure option name --enable-arm64-crypto.
Thanks for testing! (My own testing was done with cross-compiler and user-level qemu).
Out of curiosity I've also collected some benchmark numbers for gcm_aes256. (Is that a correct and sensible algorithm for that purpose?)
I think that's appropriate for benchmarking gcm_hash, but the "update" numbers are the ones that reflect gcm_hash performance.
The speedup from using pmull seems to be around 35% for encrypt/decrypt.
Interestingly, LE is about a cycle per block faster than BE even though it should have quite a few more rev64s to execute than BE. Could this be masked by memory accesses, pipelining or scheduling?
For the encrypt/decrypt operations, you also run AES (in CTR mode), which works with little-endian data.
How is the massive speedup in update to be interpreted and that BE here is indeed quite a bit faster than LE? Do I understand correctly that on update only GCM is run on unencrypted data for authentication purposes so that this number really indicates the pure GCM pmull speedup?
That's right, the "update" numbers runs only the authentication part of gcm, i.e., gcm_hash. Which is useful for benchmarking gcm_hash, but probably not so relevant for real world applications, since I'd expect it's rare to pass large amounts of "associated data" to gcm.
What's also curious is that the system's openssl 1.1.1i is consistenly reported an order of magnitude faster than nettle. I guess the major factor is that there's no optimized AES for aarch64 yet in nettle which openssl seems to have.
That would be my guess too. And if we look at the update numbers only, the new code appears a bit faster than openssl.
Just out of curiosity: I assume there's no aesni-pmull-like GCM implementation for x86_64?
That's right. There's some assembly code, but using the same algorithm as the C implementation, based on table lookups.
Regards, /Niels