Maamoun TK maamoun.tk@googlemail.com writes:
I added a PowerPC64LE optimized version of AES and GHASH to nettle.
Cool. I haven't yet looked at the patches, but some general comments:
- The main equation: The main equation for 4 block (128-bit each) can be
seen in reference [1] Digest = (((((((Digest⊕C0)*H)⊕C1)*H)⊕C2)*H)⊕C3)*H = ((Digest⊕C0)*H4)⊕(C1*H3)⊕(C2*H2)⊕(C3*H) to achieve more parallelism, this equation can be modified to address 8 blocks per one loop. It looks like as follows Digest = ((Digest⊕C0)*H8)⊕(C1*H7)⊕(C2*H6)⊕(C3*H5)⊕(C4*H4)⊕(C5*H3)⊕(C6*H2)⊕(C7*H)
Have you measured speedup when going from 4 to 8 blocks? We shouldn't add larger loops than needed.
- Handling Bit-reflection of the multiplication product [1]: This
technique moves part of the workload inside the loop to the init function so it is executed only once.
The "carry less" multiplication is symmetric under bit reversal. So great to get it out of the main loops.
- Karatsuba Algorithm: This algorithm allows to perform three
multiplication instructions instead of four, in exchange for two additional Xor. This technique is well explained with figures in reference [1]
Do you measure a speedup from this? Karatsuba usually pays off only for a bit larger sizes (but I guess overhead is a little less here than for standard multiplication).
- Test 128 bytes is added to gcm-test in testsuite to test 8x loop in
GHASH optimized function.
Good!
- Since the functionality of gcm_set_key() is replaced with
gcm_init_key() for PowerPC64LE, two warnings will pop up: [‘gcm_gf_shift’ defined but not used] and [‘gcm_gf_add’ defined but not used]
You can perhaps solve this by adding
#if HAVE_NATIVE_... #endif
around the related functions.
To test PPC code, I wonder if it's easy to add a PPC build to .gitlab-ci, in the same way as arm and mips tests. These are based on Debian packaged cross compilers and qemu-user. I'm also not that familiar with the variants within the Power and PowerPC family of processors.
Regards, /Niels