Maamoun TK maamoun.tk@googlemail.com writes:
I've tried out a split, see below patch. It's a rather large change, moving pieces to new places, but nothing difficult. I'm considering committing this to the s390x branch, what do you think?
I agree, I'll modify the patch of basic AES-128 optimized functions to be built on top of the splitted aes functions.
Ok, pushed to the s390x branch now.
memxor performs the same in C and assembly since s390 architecture offers memory xor instruction "xc" see xor_len macro in machine.m4 of the original patch for an implementation example.
But the C implmementation is somewhat complicated, splitting into several cases depending on alignment, and shifting data around to be able to do word operations. If it can be done simpler with the nc instruction, that would at least cut some overhead. (Note that memxor3 must support the overlap case needed by cbc decrypt).
However, s390x AES accelerators offer considerable speedup against C implementation with optimized internal AES. The following table demonstrates the idea more clearly:
Function S390x accelerator C implementation with optimized internal AES (Only enable aes128.asm, aes192.asm, aes256.asm)
[...]
CBC AES128 Decrypt 0.647008 cpb 3.131405 cpb
[...]
CTR AES128 Crypt 0.710237 cpb 4.767290 cpb
For these two, the speed difference should essentially be the time for the C implementation of memxor. "cpb" mean cycles per byte, right? 2-4 cycles per byte for memxor is quite slow. On my x86_64 laptop (ok, comparing apples to oranges), memxor, for the aligned case, is 0.08 cpb, and memxor twice as much. And even the C implementation is not that much slower.
GCM AES128 Encrypt 0.630504 cpb 15.473187 cpb
For GCM, are there instructions that combine AES-CTR and GCM HASH? Or are those done separately? It would be nice to have GCM HASH being fast by itself, for performance with other ciphers than aes.
Regards, /Niels