Hi Niels,
Here is the version 2 for AES/GCM stitched patch. The stitched code is in all assembly and m4 macros are used. The overall performance improved around ~110% and 120% for encrypt and decrypt respectably. Please see the attached patch and aes benchmark.
Thanks.
-Danny
> On Nov 22, 2023, at 2:27 AM, Niels Möller <nisse(a)lysator.liu.se> wrote:
>
> Danny Tsen <dtsen(a)us.ibm.com> writes:
>
>> Interleaving at the instructions level may be a good option but due to
>> PPC instruction pipeline this may need to have sufficient
>> registers/vectors. Use same vectors to change contents in successive
>> instructions may require more cycles. In that case, more
>> vectors/scalar will get involved and all vectors assignment may have
>> to change. That’s the reason I avoided in this case.
>
> To investigate the potential, I would suggest some experiments with
> software pipelining.
>
> Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the
> round loop. I think that should be 44 instructions of aes mangling, plus
> instructions to setup the counter input, and do the final xor and
> endianness things with the message. Arrange so that it loads the AES
> state in a set of registers we can call A, operating in-place on these
> registers. But at the end, arrange the XORing so that the final
> cryptotext is located in a different set of registers, B.
>
> Then, write the instructions to do ghash using the B registers as input,
> I think that should be about 20-25 instructions. Interleave those as
> well as possible with the AES instructions (say, two aes instructions,
> one ghash instruction, etc).
>
> Software pipelining means that each iteration of the loop does aes-ctr
> on four blocks, + ghash on the output for the four *previous* blocks (so
> one needs extra code outside of the loop to deal with first and last 4
> blocks). Decrypt processing should be simpler.
>
> Then you can benchmark that loop in isolation. It doesn't need to be the
> complete function, the handling of first and last blocks can be omitted,
> and it doesn't even have to be completely correct, as long as it's the
> right instruction mix and the right data dependencies. The benchmark
> should give a good idea for the potential speedup, if any, from
> instruction-level interleaving.
>
> I would hope 4-way is doable with available vector registers (and this
> inner loop should be less than 100 instructions, so not too
> unmanageable). Going up to 8-way (like the current AES code) would also
> be interesting, but as you say, you might have a shortage of registers.
> If you have to copy state between registers and memory in each iteration
> of an 8-way loop (which it looks like you also have to do in your
> current patch), that overhead cost may outweight the gains you have from
> more independence in the AES rounds.
>
> Regards,
> /Niels
>
> --
> Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677.
> Internet email is subject to wholesale government surveillance.