On Mon, Sep 13, 2021 at 5:08 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
I've also added a cbc-aes128-encrypt.asm. That gives more significant speedup, almost 60%. I think main reason for the speedup is that we avoid reloading subkeys between blocks.
I've continued this path, see branch aes-cbc. The aes128 variant is at
https://git.lysator.liu.se/nettle/nettle/-/blob/aes-cbc/x86_64/aesni/cbc-aes...
Benchmark results are positive but a bit puzzling. On my laptop (AMD Ryzen 5) I get
aes128 ECB encrypt 5450.18
This is the latest version, doing two blocks per iteration.
aes128 CBC encrypt 547.34
The general CBC mode written in C, with one call to aes128_encrypt per block. 10(!) times slower than ECB.
cbc_aes128 encrypt 865.11
The new assembly function. Almost 60% speedup over the old code, which is nice, and large enough that it seems motivated to have the new functin. But still 6 times slower than ECB. I'm not sure why. Let's look a bit closer at cycle numbers.
Not sure I get accurate cycle numbers (it's a bit tricky with variable features and turbo modes and whatnot), but it looks like ECB mode is 6 cycles per block, which would be consistent with issue of two aesenc instructions per block. While the CBC mode is 37 cycles per block, almost 4 cycles per aesenc.
This could be explained if (i) latency of aesenc is 3-4 cycles, and (ii) the processor's out-of-order machinery results in as many as 7-8 blocks processed in parallel when executing the ECB loop, i.e., instruction issue for 3-4 iterations through the loop before the results of the first iteration is ready.
I did the tests on Intel Comet Lake architecture and I can't think of another explanation, it seems x86_64 arch issues multiple blocks simultaneously without hand-written unrolling of the block loop. Also, Intel processors or at least Intel Comet Lake arch implements this machinery in a more ideal way than your testing processor (AMD Ryzen 5) so you don't even need to have 2-way interleaving of AES-ECB implementation nor a separate AES-CBC implementation. I got the same benchmark speed of ECB and CBC modes for all cases with CBC mode being always 6 times slower than ECB mode.
regards, Mamone