What's the speedup you get from assembly gcm_fill? I see the C implementation uses memcpy and WRITE_UINT32, and is likely significantly slower than the ctr_fill16 in ctr.c. But it could be improved using portable means. If done well, it should be a very small fraction of the cpu time spent for gcm encryption.
Actually, there is a counter involved along with writing operations, the idea here is that the width of a single block is 16 bytes which can fit in a vector register so one writing operation is needed plus executing e.g. 6 addition operations per cycle would boost the function performance. I measured the execution time of both C and altivec implementations on POWER8 for 32,768 blocks (512 KB), repeated 10000 times and compiled with -O3 gcm_fill_c() took 0.000073 seconds to execute gcm_fill_altivec() took 0.000019 seconds to execute As you can see, the function itself isn't time consuming at all and maybe optimizing it is not worth it, but gcm_fill is part of AES_CTR and what other libraries usually do is optimizing AES_CTR as a whole so I considered optimizing it to stay on the same track.
What table layout is used by the assembly gcm_init_key? I would have
expected not much tables to be needed when using the special multiply instructions. And Nettle generally doesn't try very hard to optimize key setup, under the theory that applications that need high performance also do a lot of work with each key. E.g., we use C code for AES key setup even when it could be sped up by assembly using special instructions.
So it would be easier for me to start with a patch for gcm_hash only (possibly with supporting C code for key setup).
It's a little bit more complicated than just special multiply instructions, let me explain that
From reference [1]:
To compute the GHASH digest of 4 consecutive blocks, we use a method of parallelization as described in References [3]. The method can be described by the following equations: Ciphertext inputs: C0, C1, C2, C3. Digest input/output: Digest. Digest = (((((((Digest ⊕C0)*H)⊕C1)*H)⊕C2)*H)⊕C3)*H = ((Digest⊕C0)*H^4)⊕(C1*H^3)⊕(C2*H^2)⊕(C3*H)
As you can see, there is a pre-computed constants here "H,H^2,H^3,H^3" which should settle in a table to be used in upcoming hash operations, furthermore to handle bit-reflection of the multiplication product, we precompute "H << 1 modulo polynomial g(x)" then to pre-compute "H^2,H^3,H^3" we use the operation: reflected (A)*reflected (H<<1 mod g(x)) = reflected (A*H) mod g(x) so we got H = H << 1 modulo polynomial g(x) then H^2 = H * H and so on. we can use these values as 128-bit constants and avoid shifting the product. Also in the init function we have to precompute inputs to the Karatsuba Algorithm, so we adapt the previous constants to be compatible with the Karatsuba Algorithm. In summary, The table layout varies according to the techniques used in the different implementations, it's unreasonable to fill certain table using C as standard key setup, for example there are several x86 GCM implementations in OpenSSL library and the table layouts of SSE and AVX implementations are different from each other.
[1] https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/co...