On Thu, Apr 1, 2021 at 5:21 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
(iii) I've considered doing it earlier, to make it easier to implement aes without a round loop (like for all current versions of aes-encrypt-internal.*). E.g., on x86_64, for aes128 we could load all subkeys into registers and still have registers left to do two or more blocks in parallel, but then we'd need to override aes128_encrypt separately from the other aes*_encrypt.
I've given this a try, see experimental patch below. It adds a x86_64/aesni/aes128-encrypt.asm, with a 2-way loop. It gives a very modest speedup, 5%, when I benchmark on my laptop (which is now a pretty fast machine, AMD Ryzen 5). I've also added a cbc-aes128-encrypt.asm. That gives more significant speedup, almost 60%. I think main reason for the speedup is that we avoid reloading subkeys between blocks.
If we want to go this way, I wonder how to do it without an explosion of files and functions. For s390x, it seems each function will be very small, but not so for most other archs. There are at least three modes that are similar to cbc encrypt in that they have to process blocks sequentially, with no parallelism: CBC encrypt, CMAC, and XTS (there may be more). It's not so nice if we need (modes × ciphers) number of assembly files, with lots of duplication.
I can think of a core function for AES-CBC mode cbc_aes_encrypt that supplies cbc_aes128_encrypt, cbc_aes192_encrypt, and cbc_aes256_encrypt function, now we can optimize cbc_aes_encrypt in assembly while taking care of rounds parameter during implementing. I still prefer duplicating files and functions for AES modes with different rounds rather than going with this approach as I can't think of any other solution.