In preparing for merging the gcm-aes "stitched" implementation, I'm reviewing the existing ghash code. WIP branch "ppc-ghash-macros.
I've introduced a macro GHASH_REDUCE, for the reduction logic. Besides that, I've been able to improve scheduling of the reduction instructions (adding in the result of vpmsumd last seems to improve parallelism, some 3% speedup of gcm_update on power10, benchmarked on cfarm120). I've also streamlined the way load offsets are used, and trimmed the number of needed vector registers slightly.
For the AES code, I've merged the new macros (I settled on the names OPN_XXY and OPN_XXXY), no change in speed expected from that change.
I've also tried to understand the differenct between AES encrypt and decrypt, where decrypt is much slower, and uses an extra xor instruction in the round loop. I think the reason for that is that other AES implementations (including x86_64 and arm64 instructions, and Nettle's C implementation) expect the decryption subkeys to be transformed via the AES "MIX_COLUMN" operation, see https://gitlab.com/gnutls/nettle/-/blob/master/aes-invert-internal.c?ref_typ...
While the powerpc64 vncipher instruction really wants the original subkeys, not transformed. So on power, it would be better to have a _nettle_aes_invert that is essentially a memcpy, and then the aes decrypt assembly code could be reworked without the xors, and run at exactly the same speed as encryption. Current _nettle_aes_invert also changes the order of the subkeys, with a FIXME comment suggesting that it would be better to update the order keys are accessed in the aes decryption functions.
Regards, /Niels