Re: [RFC PATCH 0/6] Introduce combined AES-GCM assembly for POWER9+

20 May 2021


      "Christopher M. Riedl" cmr@linux.ibm.com writes:
...
So in total, if we assume an ideal (but impossible) zero-cost version
for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0 vector
load/stores we can only account for 11.82 cycles/block; leaving 4.97
cycles/block as an additional benefit of the combined implementation.
One hypothesis for that gain is that we can avoid storing the aes input
in memory at all; instead, generated the counter values on the fly in
the appropriate registers.
...
...
Another potential overhead is that data is stored to memory when passed
between these functions. It seems we store a block 3 times, and loads a
block 4 times (the additional accesses should be cache friendly, but
wills till cost some execution resources). Optimizing that seems to need
some kind of combined function. But maybe it is sufficient to optimize
something a bit more general than aes gcm, e.g., aes ctr?
This would basically have to replace the nettle_crypt16 function call
with arch-specific assembly, right? I can code this up and try it out in
the context of AES-GCM.
Yes, something like that. If we leave the _nettle_gcm_hash unchanged
(with its own independent assembly implementation), and look at
gcm_encrypt, what we have is
const void *cipher, nettle_cipher_func *f,
_nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst, src);
It would be nice if we could replace that with a call to aes_ctr_crypt,
and then optimizing that would benefit both gcm and plain ctr. But it's
not quite that easy, because gcm unfortunately uses it's own variant of
ctr mode, which is why we need to pass the gcm_fill function in the
first place.
So if we need separate assembly for aes_plain_ctr and aes_gcm_ctr (they
*might* still share some code, but they would be distinct entry points).
Say we call the gcm-specific ctr function from some variant of
gcm_encrypt via a different function pointer. Then that gcm_encrypt
variant is getting a bit pointless. Maybe it's better to do
void aes128_gcm_encrypt(...)
  {
    _nettle_aes128_gcm_ctr(...);
    _nettle_gcm_hash(...);
  }
At least, we avoid duplicating the _gcm_hash for aes128, aes192, aes256
(and any other algorithms we might want to optimize in a similar way).
And each of the aes assembly routines should be fairly small and easy to
maintain.
I wonder if there are any reasonable alternatives with similar
performance? One idea that occurs to me is to replace the role of
gcm_fill function (and the nettle_fill16_fOBunc type) with an
arch-specific assembly only hook-interface that gets inputs in specified
registers, and is expected to produce the next cipher input in
registers.
We could then have a aes128_any_encrypt that takes the same args as
aes128_encrypt + a pointer to such a magic assembly function.
The aes128_any_encrypt assembly would then put required input in the
right registers (address of clear text, current counter block, previous
ciphertext block, etc) and have a loop where each iteration calls the
hook, and encrypts a block from registers.
But I'm afraid it's not going to be so easy, given that where possible
(i.e., all modes but cbc encrypt) would like to have the option to do
multiple blocks in parallell. Perhaps better to have an assembly
interface to functions doing ECB on one block, two blocks, three blocks
(if there are sufficient number of registers), etc, in registers, and
call that from the other assembly functions. A bit like the recent
chacha_Ncore functions, but with input and output output in registers
rather than stored in memory.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [RFC PATCH 0/6] Introduce combined AES-GCM assembly for POWER9+