"Christopher M. Riedl" cmr@linux.ibm.com writes:
So in total, if we assume an ideal (but impossible) zero-cost version for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0 vector load/stores we can only account for 11.82 cycles/block; leaving 4.97 cycles/block as an additional benefit of the combined implementation.
One hypothesis for that gain is that we can avoid storing the aes input in memory at all; instead, generated the counter values on the fly in the appropriate registers.
Another potential overhead is that data is stored to memory when passed between these functions. It seems we store a block 3 times, and loads a block 4 times (the additional accesses should be cache friendly, but wills till cost some execution resources). Optimizing that seems to need some kind of combined function. But maybe it is sufficient to optimize something a bit more general than aes gcm, e.g., aes ctr?
This would basically have to replace the nettle_crypt16 function call with arch-specific assembly, right? I can code this up and try it out in the context of AES-GCM.
Yes, something like that. If we leave the _nettle_gcm_hash unchanged (with its own independent assembly implementation), and look at gcm_encrypt, what we have is
const void *cipher, nettle_cipher_func *f,
_nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src);
It would be nice if we could replace that with a call to aes_ctr_crypt, and then optimizing that would benefit both gcm and plain ctr. But it's not quite that easy, because gcm unfortunately uses it's own variant of ctr mode, which is why we need to pass the gcm_fill function in the first place.
So if we need separate assembly for aes_plain_ctr and aes_gcm_ctr (they *might* still share some code, but they would be distinct entry points). Say we call the gcm-specific ctr function from some variant of gcm_encrypt via a different function pointer. Then that gcm_encrypt variant is getting a bit pointless. Maybe it's better to do
void aes128_gcm_encrypt(...) { _nettle_aes128_gcm_ctr(...); _nettle_gcm_hash(...); }
At least, we avoid duplicating the _gcm_hash for aes128, aes192, aes256 (and any other algorithms we might want to optimize in a similar way). And each of the aes assembly routines should be fairly small and easy to maintain.
I wonder if there are any reasonable alternatives with similar performance? One idea that occurs to me is to replace the role of gcm_fill function (and the nettle_fill16_fOBunc type) with an arch-specific assembly only hook-interface that gets inputs in specified registers, and is expected to produce the next cipher input in registers.
We could then have a aes128_any_encrypt that takes the same args as aes128_encrypt + a pointer to such a magic assembly function.
The aes128_any_encrypt assembly would then put required input in the right registers (address of clear text, current counter block, previous ciphertext block, etc) and have a loop where each iteration calls the hook, and encrypts a block from registers.
But I'm afraid it's not going to be so easy, given that where possible (i.e., all modes but cbc encrypt) would like to have the option to do multiple blocks in parallell. Perhaps better to have an assembly interface to functions doing ECB on one block, two blocks, three blocks (if there are sufficient number of registers), etc, in registers, and call that from the other assembly functions. A bit like the recent chacha_Ncore functions, but with input and output output in registers rather than stored in memory.
Regards, /Niels