Re: [S390x] Optimize AES modes

1 Apr 2021


      On Thu, Apr 1, 2021 at 5:21 PM Niels Möller nisse@lysator.liu.se wrote:
...
nisse@lysator.liu.se (Niels Möller) writes:
...
(iii) I've considered doing it earlier, to make it easier to implement
      aes without a round loop (like for all current versions of
      aes-encrypt-internal.*). E.g., on x86_64, for aes128 we could load
      all subkeys into registers and still have registers left to do two
      or more blocks in parallel, but then we'd need to override
      aes128_encrypt separately from the other aes*_encrypt.
I've given this a try, see experimental patch below. It adds a
x86_64/aesni/aes128-encrypt.asm, with a 2-way loop. It gives a very
modest speedup, 5%, when I benchmark on my laptop (which is now a pretty
fast machine, AMD Ryzen 5). I've also added a cbc-aes128-encrypt.asm.
That gives more significant speedup, almost 60%. I think main reason for
the speedup is that we avoid reloading subkeys between blocks.
If we want to go this way, I wonder how to do it without an explosion of
files and functions. For s390x, it seems each function will be very
small, but not so for most other archs. There are at least three modes
that are similar to cbc encrypt in that they have to process blocks
sequentially, with no parallelism: CBC encrypt, CMAC, and XTS (there may
be more). It's not so nice if we need (modes × ciphers) number of assembly
files, with lots of duplication.
I can think of a core function for AES-CBC mode cbc_aes_encrypt that
supplies cbc_aes128_encrypt, cbc_aes192_encrypt, and cbc_aes256_encrypt
function, now we can optimize cbc_aes_encrypt in assembly while taking care
of rounds parameter during implementing. I still prefer duplicating files
and functions for AES modes with different rounds rather than going with
this approach as I can't think of any other solution.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [S390x] Optimize AES modes