Hi Niels,
Here is the version 2 for AES/GCM stitched patch. The stitched code is in all assembly and m4 macros are used. The overall performance improved around ~110% and 120% for encrypt and decrypt respectably. Please see the attached patch and aes benchmark.
Thanks. -Danny
On Nov 22, 2023, at 2:27 AM, Niels Möller nisse@lysator.liu.se wrote:
Danny Tsen dtsen@us.ibm.com writes:
Interleaving at the instructions level may be a good option but due to PPC instruction pipeline this may need to have sufficient registers/vectors. Use same vectors to change contents in successive instructions may require more cycles. In that case, more vectors/scalar will get involved and all vectors assignment may have to change. That’s the reason I avoided in this case.
To investigate the potential, I would suggest some experiments with software pipelining.
Write a loop to do 4 blocks of ctr-aes128 at a time, fully unrolling the round loop. I think that should be 44 instructions of aes mangling, plus instructions to setup the counter input, and do the final xor and endianness things with the message. Arrange so that it loads the AES state in a set of registers we can call A, operating in-place on these registers. But at the end, arrange the XORing so that the final cryptotext is located in a different set of registers, B.
Then, write the instructions to do ghash using the B registers as input, I think that should be about 20-25 instructions. Interleave those as well as possible with the AES instructions (say, two aes instructions, one ghash instruction, etc).
Software pipelining means that each iteration of the loop does aes-ctr on four blocks, + ghash on the output for the four *previous* blocks (so one needs extra code outside of the loop to deal with first and last 4 blocks). Decrypt processing should be simpler.
Then you can benchmark that loop in isolation. It doesn't need to be the complete function, the handling of first and last blocks can be omitted, and it doesn't even have to be completely correct, as long as it's the right instruction mix and the right data dependencies. The benchmark should give a good idea for the potential speedup, if any, from instruction-level interleaving.
I would hope 4-way is doable with available vector registers (and this inner loop should be less than 100 instructions, so not too unmanageable). Going up to 8-way (like the current AES code) would also be interesting, but as you say, you might have a shortage of registers. If you have to copy state between registers and memory in each iteration of an 8-way loop (which it looks like you also have to do in your current patch), that overhead cost may outweight the gains you have from more independence in the AES rounds.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.