Maamoun TK maamoun.tk@googlemail.com writes:
This implementation takes advantage of research made by Niels Möller to optimize GCM on PowerPC, this optimization yields a +27.7% performance boost on POWER8 over the previous implementation that was based on intel documents. The performance comparison is made by processing 4 blocks per loop without any further optimizations.
Hi, the patch didn't apply cleanly due to email line breaks (maybe try posting as a text attachment next time?), but I've applied it semi-manually, and pushed it to a branch ppc-gcm.
I gave it a test run on gcc112 in the gcc compile farm, and speedup of gcm update seems to be 26 times(!) compared to the C version.
I made some documentations between the lines but I suggest writing a document similar to the intel ones that go into more details and clarify the preference of this method.
Where would that documentation be published? In the Nettle manual, as some IBM white paper, or as a more-or-less academic paper, e.g., on arxiv? I will not be able to spend much time on writing, but I'd be happy to review.
I'm also curious if this method can also make a difference in other architectures like ARM, I'm planning to try it out for ARM to figure that out.
I have a sketch of ARM Neon code doing the equivalent of two vpmsumd, with reasonable parallelism. Quite a lot of instructions needed.
Regards, /Niels
+C Alignment of gcm_key table elements, which is declared in gcm.h +define(`TableElemAlign', `0x100')
I still find this large constant puzzling. If I try
struct gcm_key key; printf("sizeof (key): %zd, sizeof(key.h[0]): %zd\n", sizeof(key), sizeof(key.h[0]));
(I added it to the start of test_main in gcm-test.c) and run on the gcc112 machine, I get
sizeof (key): 4096, sizeof(key.h[0]): 16
Which is what I'd expect, with elements of size 16 bytes, not 256 bytes.
I haven't yet had the time to read the code carefully.
Regards, /Niels