On Tue, Jun 1, 2021 at 11:21 PM Christopher M. Riedl cmr@linux.ibm.com wrote:
On Thu May 20, 2021 at 3:59 PM EDT, Maamoun TK wrote:
On Thu, May 20, 2021 at 10:06 PM Niels Möller nisse@lysator.liu.se wrote:
"Christopher M. Riedl" cmr@linux.ibm.com writes:
So in total, if we assume an ideal (but impossible) zero-cost version for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0
vector
load/stores we can only account for 11.82 cycles/block; leaving 4.97 cycles/block as an additional benefit of the combined implementation.
One hypothesis for that gain is that we can avoid storing the aes input in memory at all; instead, generated the counter values on the fly in the appropriate registers.
Another potential overhead is that data is stored to memory when
passed
between these functions. It seems we store a block 3 times, and
loads a
block 4 times (the additional accesses should be cache friendly, but wills till cost some execution resources). Optimizing that seems to
need
some kind of combined function. But maybe it is sufficient to
optimize
something a bit more general than aes gcm, e.g., aes ctr?
This would basically have to replace the nettle_crypt16 function call with arch-specific assembly, right? I can code this up and try it
out in
the context of AES-GCM.
Yes, something like that. If we leave the _nettle_gcm_hash unchanged (with its own independent assembly implementation), and look at gcm_encrypt, what we have is
const void *cipher, nettle_cipher_func *f,
_nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst,
src);
It would be nice if we could replace that with a call to aes_ctr_crypt, and then optimizing that would benefit both gcm and plain ctr. But it's not quite that easy, because gcm unfortunately uses it's own variant of ctr mode, which is why we need to pass the gcm_fill function in the first place.
So if we need separate assembly for aes_plain_ctr and aes_gcm_ctr (they *might* still share some code, but they would be distinct entry
points).
Say we call the gcm-specific ctr function from some variant of gcm_encrypt via a different function pointer. Then that gcm_encrypt variant is getting a bit pointless. Maybe it's better to do
void aes128_gcm_encrypt(...) { _nettle_aes128_gcm_ctr(...); _nettle_gcm_hash(...); }
At least, we avoid duplicating the _gcm_hash for aes128, aes192, aes256 (and any other algorithms we might want to optimize in a similar way). And each of the aes assembly routines should be fairly small and easy
to
maintain.
While writing the white paper "Optimize AES-GCM for PowerPC architecture processors", I concluded that is the best approach to implement for PowerPC architecture, easy to maintain, avoid duplication, and perform well. I've separated aes_gcm encrypt/decrypt to two functions, aes_ctr and ghash. Both implemented using Power ISA v3.00 assisted with vector-scalar registers. I got 1.18 cycles/byte for gcm-aes-128 encrypt/decrypt, 1.31 cycles/byte for gcm-aes-192 encrypt/decrypt, and 1.44 cycles/byte for gcm-aes-256 encrypt/decrypt.
Neat, did you base that on the aes-gcm combined series I posted here or completely different/new code?
It's based on new code written to fit the paper context.