Re: [RFC PATCH 0/6] Introduce combined AES-GCM assembly for POWER9+

1 Jun 2021


      On Tue, Jun 1, 2021 at 11:21 PM Christopher M. Riedl cmr@linux.ibm.com
wrote:
...
On Thu May 20, 2021 at 3:59 PM EDT, Maamoun TK wrote:
...
On Thu, May 20, 2021 at 10:06 PM Niels Möller nisse@lysator.liu.se
wrote:
...
"Christopher M. Riedl" cmr@linux.ibm.com writes:
...
So in total, if we assume an ideal (but impossible) zero-cost version
for memxor, memxor3, and gcm_fill and avoid permutes via ISA 3.0
vector
...
...
...
load/stores we can only account for 11.82 cycles/block; leaving 4.97
cycles/block as an additional benefit of the combined implementation.
One hypothesis for that gain is that we can avoid storing the aes input
in memory at all; instead, generated the counter values on the fly in
the appropriate registers.
...
...
Another potential overhead is that data is stored to memory when
passed
...
...
...
...
between these functions. It seems we store a block 3 times, and
loads a
...
...
...
...
block 4 times (the additional accesses should be cache friendly, but
wills till cost some execution resources). Optimizing that seems to
need
...
...
...
...
some kind of combined function. But maybe it is sufficient to
optimize
...
...
...
...
something a bit more general than aes gcm, e.g., aes ctr?
This would basically have to replace the nettle_crypt16 function call
with arch-specific assembly, right? I can code this up and try it
out in
...
...
...
the context of AES-GCM.
Yes, something like that. If we leave the _nettle_gcm_hash unchanged
(with its own independent assembly implementation), and look at
gcm_encrypt, what we have is
  const void *cipher, nettle_cipher_func *f,


_nettle_ctr_crypt16(cipher, f, gcm_fill, ctx->ctr.b, length, dst,
src);
...
...
It would be nice if we could replace that with a call to aes_ctr_crypt,
and then optimizing that would benefit both gcm and plain ctr. But it's
not quite that easy, because gcm unfortunately uses it's own variant of
ctr mode, which is why we need to pass the gcm_fill function in the
first place.
So if we need separate assembly for aes_plain_ctr and aes_gcm_ctr (they
*might* still share some code, but they would be distinct entry
points).
...
...
Say we call the gcm-specific ctr function from some variant of
gcm_encrypt via a different function pointer. Then that gcm_encrypt
variant is getting a bit pointless. Maybe it's better to do
void aes128_gcm_encrypt(...)
  {
    _nettle_aes128_gcm_ctr(...);
    _nettle_gcm_hash(...);
  }
At least, we avoid duplicating the _gcm_hash for aes128, aes192, aes256
(and any other algorithms we might want to optimize in a similar way).
And each of the aes assembly routines should be fairly small and easy
to
...
...
maintain.
While writing the white paper "Optimize AES-GCM for PowerPC architecture
processors", I concluded that is the best approach to implement for
PowerPC
architecture, easy to maintain, avoid duplication, and perform well.
I've separated aes_gcm encrypt/decrypt to two functions, aes_ctr and
ghash.
Both implemented using Power ISA v3.00 assisted with vector-scalar
registers.
I got 1.18 cycles/byte for gcm-aes-128 encrypt/decrypt, 1.31 cycles/byte
for gcm-aes-192 encrypt/decrypt, and 1.44 cycles/byte for gcm-aes-256
encrypt/decrypt.
Neat, did you base that on the aes-gcm combined series I posted here or
completely different/new code?
It's based on new code written to fit the paper context.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [RFC PATCH 0/6] Introduce combined AES-GCM assembly for POWER9+