Re: ppc64: v2, AES/GCM Performance improvement with stitched implementation

7 Dec 2023


      Hi Niels,
Here is the version 2 for AES/GCM stitched patch.  The stitched code is in all assembly and m4 macros are used.  The overall performance improved around ~110% and 120% for encrypt and decrypt respectably.   Please see the attached patch and aes benchmark.
Thanks.
-Danny
...
On Nov 22, 2023, at 2:27 AM, Niels Möller nisse@lysator.liu.se wrote:
Danny Tsen dtsen@us.ibm.com writes:
...
Interleaving at the instructions level may be a good option but due to
PPC instruction pipeline this may need to have sufficient
registers/vectors. Use same vectors to change contents in successive
instructions may require more cycles. In that case, more
vectors/scalar will get involved and all vectors assignment may have
to change. That’s the reason I avoided in this case.
To investigate the potential, I would suggest some experiments with
software pipelining.
Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the
round loop. I think that should be 44 instructions of aes mangling, plus
instructions to setup the counter input, and do the final xor and
endianness things with the message. Arrange so that it loads the AES
state in a set of registers we can call A, operating in-place on these
registers. But at the end, arrange the XORing so that the final
cryptotext is located in a different set of registers, B.
Then, write the instructions to do ghash using the B registers as input,
I think that should be about 20-25 instructions. Interleave those as
well as possible with the AES instructions (say, two aes instructions,
one ghash instruction, etc).
Software pipelining means that each iteration of the loop does aes-ctr
on four blocks, + ghash on the output for the four *previous* blocks (so
one needs extra code outside of the loop to deal with first and last 4
blocks). Decrypt processing should be simpler.
Then you can benchmark that loop in isolation. It doesn't need to be the
complete function, the handling of first and last blocks can be omitted,
and it doesn't even have to be completely correct, as long as it's the
right instruction mix and the right data dependencies. The benchmark
should give a good idea for the potential speedup, if any, from
instruction-level interleaving.
I would hope 4-way is doable with available vector registers (and this
inner loop should be less than 100 instructions, so not too
unmanageable). Going up to 8-way (like the current AES code) would also
be interesting, but as you say, you might have a shortage of registers.
If you have to copy state between registers and memory in each iteration
of an 8-way loop (which it looks like you also have to do in your
current patch), that overhead cost may outweight the gains you have from
more independence in the AES rounds.
Regards,
/Niels
--
Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: ppc64: v2, AES/GCM Performance improvement with stitched implementation