Maamoun TK maamoun.tk@googlemail.com writes:
I measured the latency and throughput of vcipher/vncipher/vxor instructions for POWER8 vcipher/vncipher throughput 6 instructions per cycle latency 0.91 clock cycles vxor throughput 6 instructions per cycle latency 0.32 clock cycles
Latency less than one cycle sounds wrong. Usually, simple ALU instructions like xor has a latency of exactly one cycle (i.e., when an instruction starts executing (all inputs are available), the result is available for depending instructions exactly one cycle later). While deeply pipelined instructions, e.g., multiplication, can have a latency of several cycles but still a throughput of one or a few instructions per cycle.
See https://gmplib.org/~tege/x86-timing.pdf for background and lots of numbers for x86 processors.
So the ideal option for POWER8 is processing 8 blocks, it has +12% performance over processing 4 blocks.
Sounds reasonable to me.
powerpc64/P8/aes-decrypt-internal.asm | 367
I take it "P8" in the path is for power 8? Are the crypto extensions always available for power 8? If not, directory should be named differently.
To get going, I've merged this and the machine.m4 patch to a development branch. I'd like to do things stepwise, first do the minimal configure changes to get AES working (and maybe with default on, to get it exercised by the .gitlab-ci machinery), then add ghash and fat builds (not sure in which order). I wanted to also merge the README patch right away, but that failed due to line breaks in the email.
BTW, about fat tests, I'm considering adding a make target "check-fat" which will run make check with some different settings of NETTLE_FAT_OVERRIDE (platform specific, and determined by configure).
Regards /Niels