Maamoun TK maamoun.tk@googlemail.com writes:
This is great information that I can keep in my memory for next implementations. s390x arch offers 'xc' instruction "Storage-to-storage XOR" at maximum length of 256 bytes but we can do as many iterations as we need. I optimized memxor using that instruction as it achieves the optimal performance for such case, I'll attach the patch at the end of message.
Nice! I'd like to merge this as soon as the s390x ci is up and running again.
Unfortunately, I couldn't manage to optimize memxor3 using 'xc' instruction because while it supports the overlapped operands it processes them from left to right, one byte at a time.
Hmm, I wonder if there's some way to work around that.
However, I think optimizing just memxor could make a good sense of how much it would increase the performance of AES modes. CBC mode could come in handy here since it uses memxor in encrypt and decrypt operations in case the operands of decrypt operation don't overlap. Here is the benchmark result of CBC mode:
*---------------------------------------------------------------------------------------------------* | AES-128 Encrypt | AES-128 Decrypt | |------------------------------------------------------------------------|----------------------------| | CBC-Accelerator 1.18 cbp | 0.75 cbp | | Basic AES-Accelerator 13.50 cbp | 3.34 cbp | | Basic AES-Accelerator with memxor 15.50 | 1.57 | *-----------------------------------------------------------------------------------------------------*
This seems to confirm that cbc encrypt is the operation that gains the most from assembly for the combined operation. That aes decrypt can also gain a factor two in performance, does that mean that both aes-cbc and memxor run at speed limited by memory bandwidth? And then the gain is from one less pass loading and storing data from memory?
What unit is "cbp"? If it's cycles per byte, 0.77 cycles/byte for memxor (the cost of "Basic AES-Accelerator with memxor" minus cost of CBC-Accellerator) sounds unexpectedly slow, compared to, e.g, x86_64, where I get 0.08 cycles per byte (regardless of alignment), or 0.64 cycles per 64-bit word.
Regards, /Niels