nisse@lysator.liu.se (Niels Möller) writes:
I've also added a cbc-aes128-encrypt.asm. That gives more significant speedup, almost 60%. I think main reason for the speedup is that we avoid reloading subkeys between blocks.
I've continued this path, see branch aes-cbc. The aes128 variant is at
https://git.lysator.liu.se/nettle/nettle/-/blob/aes-cbc/x86_64/aesni/cbc-aes...
Benchmark results are positive but a bit puzzling. On my laptop (AMD Ryzen 5) I get
aes128 ECB encrypt 5450.18
This is the latest version, doing two blocks per iteration.
aes128 CBC encrypt 547.34
The general CBC mode written in C, with one call to aes128_encrypt per block. 10(!) times slower than ECB.
cbc_aes128 encrypt 865.11
The new assembly function. Almost 60% speedup over the old code, which is nice, and large enough that it seems motivated to have the new functin. But still 6 times slower than ECB. I'm not sure why. Let's look a bit closer at cycle numbers.
Not sure I get accurate cycle numbers (it's a bit tricky with variable features and turbo modes and whatnot), but it looks like ECB mode is 6 cycles per block, which would be consistent with issue of two aesenc instructions per block. While the CBC mode is 37 cycles per block, almost 4 cycles per aesenc.
This could be explained if (i) latency of aesenc is 3-4 cycles, and (ii) the processor's out-of-order machinery results in as many as 7-8 blocks processed in parallel when executing the ECB loop, i.e., instruction issue for 3-4 iterations through the loop before the results of the first iteration is ready.
The interface for the new function is
struct cbc_aes128_ctx CBC_CTX(struct aes128_ctx, AES_BLOCK_SIZE); void cbc_aes128_encrypt(struct cbc_aes128_ctx *ctx, size_t length, uint8_t *dst, const uint8_t *src);
I'm not that fond of the struct cbc_aes128_ctx though, which includes both (constant) subkeys and iv. So I'm considering changing that to
void cbc_aes128_encrypt(const struct aes128_ctx *ctx, uint8_t *iv, size_t length, uint8_t *dst, const uint8_t *src);
I.e., similar to cbc_encrypt, but without the arguments nettle_cipher_func *f, size_t block_size.
Regards, /Niels