---------- Forwarded message --------- From: Maamoun TK maamoun.tk@googlemail.com Date: Thu, Nov 12, 2020 at 7:42 PM Subject: Re: [PowerPC] GCM optimization To: Niels Möller nisse@lysator.liu.se
On Thu, Nov 12, 2020 at 6:40 PM Niels Möller nisse@lysator.liu.se wrote:
I gave it a test run on gcc112 in the gcc compile farm, and speedup of gcm update seems to be 26 times(!) compared to the C version.
That's reasonable, I got similar speedup on more stable POWER instances than gcc compile farm.
Where would that documentation be published? In the Nettle manual, as some IBM white paper, or as a more-or-less academic paper, e.g., on arxiv? I will not be able to spend much time on writing, but I'd be happy to review.
I'll start writing the papers once I got more details from IBM, similar to intel documents, the document will be academic and practical at the same time, I'll dive into finite field equations to demonstrate how we get there as well as I'll add a practical example to clarify the preference of this method in addition to the expected speedup of this method. My intention that other crypto libraries could take advantage of this document or maybe be a starting point for further improvements to the algorithm so I'm checking if IBM would publish or approve such a document the same as intel.
I have a sketch of ARM Neon code doing the equivalent of two vpmsumd, with reasonable parallelism. Quite a lot of instructions needed.
If you don't have much time, you can send it here and I'll continue from that point. I'm planning to compare the new method with the usual method with and without the karatsuba algorithm.
+C Alignment of gcm_key table elements, which is declared in gcm.h
+define(`TableElemAlign', `0x100')
I still find this large constant puzzling. If I try
struct gcm_key key; printf("sizeof (key): %zd, sizeof(key.h[0]): %zd\n", sizeof(key), sizeof(key.h[0]));
(I added it to the start of test_main in gcm-test.c) and run on the gcc112 machine, I get
sizeof (key): 4096, sizeof(key.h[0]): 16
Which is what I'd expect, with elements of size 16 bytes, not 256 bytes.
I haven't yet had the time to read the code carefully.
You see, the alignment of each element is 0x100 (256). The table has 16 elements and you got the size of the table 4096 which is reasonable because 16*256=4096
regards, Mamone
On Fri, Nov 20, 2020 at 3:39 PM Maamoun TK maamoun.tk@googlemail.com wrote:
---------- Forwarded message --------- From: Maamoun TK maamoun.tk@googlemail.com Date: Thu, Nov 12, 2020 at 7:42 PM Subject: Re: [PowerPC] GCM optimization To: Niels Möller nisse@lysator.liu.se
On Thu, Nov 12, 2020 at 6:40 PM Niels Möller nisse@lysator.liu.se wrote:
I gave it a test run on gcc112 in the gcc compile farm, and speedup of gcm update seems to be 26 times(!) compared to the C version.
That's reasonable, I got similar speedup on more stable POWER instances than gcc compile farm.
Where would that documentation be published? In the Nettle manual, as some IBM white paper, or as a more-or-less academic paper, e.g., on arxiv? I will not be able to spend much time on writing, but I'd be happy to review.
I'll start writing the papers once I got more details from IBM, similar to intel documents, the document will be academic and practical at the same time, I'll dive into finite field equations to demonstrate how we get there as well as I'll add a practical example to clarify the preference of this method in addition to the expected speedup of this method. My intention that other crypto libraries could take advantage of this document or maybe be a starting point for further improvements to the algorithm so I'm checking if IBM would publish or approve such a document the same as intel.
You might want to ping Steven Munroe for feedback. He's an IBM old-timer who usually helps with implementations and technical editing. He has amazing knowledge of the POWER chips. He also has a GitHub with some nice POWER libraries. He has been CC'd.
Munroe also helped with https://github.com/noloader/POWER8-crypto/blob/master/power8-crypto.pdf. We wrote it because IBM documentation sucks. As far as I know there is no IBM documentation (expect a blog post that explains some of Andy Polyakov's OpenSSL code).
If you want to add information to the power8-crypto.pdf doc, then we can make you an author and collaborator for check-ins. As a collaborator, you won't have to waste time with patches and asking permission. Just edit the doc like a wiki page.
The power8-crypto.pdf is written in DocBook. The DocBook setup for Fedora and Ubuntu is in the document https://github.com/noloader/POWER8-crypto/blob/master/docbook.pdf.
Jeff
Thank you, I'll take a look at the document.
regards, Mamone
On Sun, Nov 22, 2020 at 5:18 PM Jeffrey Walton noloader@gmail.com wrote:
On Fri, Nov 20, 2020 at 3:39 PM Maamoun TK maamoun.tk@googlemail.com wrote:
---------- Forwarded message --------- From: Maamoun TK maamoun.tk@googlemail.com Date: Thu, Nov 12, 2020 at 7:42 PM Subject: Re: [PowerPC] GCM optimization To: Niels Möller nisse@lysator.liu.se
On Thu, Nov 12, 2020 at 6:40 PM Niels Möller nisse@lysator.liu.se
wrote:
I gave it a test run on gcc112 in the gcc compile farm, and speedup of gcm update seems to be 26 times(!) compared to the C version.
That's reasonable, I got similar speedup on more stable POWER instances than gcc compile farm.
Where would that documentation be published? In the Nettle manual, as some IBM white paper, or as a more-or-less academic paper, e.g., on arxiv? I will not be able to spend much time on writing, but I'd be happy to review.
I'll start writing the papers once I got more details from IBM, similar
to
intel documents, the document will be academic and practical at the same time, I'll dive into finite field equations to demonstrate how we get
there
as well as I'll add a practical example to clarify the preference of this method in addition to the expected speedup of this method. My intention that other crypto libraries could take advantage of this
document
or maybe be a starting point for further improvements to the algorithm so I'm checking if IBM would publish or approve such a document the same as intel.
You might want to ping Steven Munroe for feedback. He's an IBM old-timer who usually helps with implementations and technical editing. He has amazing knowledge of the POWER chips. He also has a GitHub with some nice POWER libraries. He has been CC'd.
Munroe also helped with https://github.com/noloader/POWER8-crypto/blob/master/power8-crypto.pdf. We wrote it because IBM documentation sucks. As far as I know there is no IBM documentation (expect a blog post that explains some of Andy Polyakov's OpenSSL code).
If you want to add information to the power8-crypto.pdf doc, then we can make you an author and collaborator for check-ins. As a collaborator, you won't have to waste time with patches and asking permission. Just edit the doc like a wiki page.
The power8-crypto.pdf is written in DocBook. The DocBook setup for Fedora and Ubuntu is in the document https://github.com/noloader/POWER8-crypto/blob/master/docbook.pdf.
Jeff
On Thu, Nov 12, 2020 at 07:45:14PM +0200, Maamoun TK wrote:
---------- Forwarded message --------- From: Maamoun TK maamoun.tk@googlemail.com Date: Thu, Nov 12, 2020 at 7:42 PM Subject: Re: [PowerPC] GCM optimization To: Niels Möller nisse@lysator.liu.se
On Thu, Nov 12, 2020 at 6:40 PM Niels Möller nisse@lysator.liu.se wrote:
I gave it a test run on gcc112 in the gcc compile farm, and speedup of gcm update seems to be 26 times(!) compared to the C version.
That's reasonable, I got similar speedup on more stable POWER instances than gcc compile farm.
Where would that documentation be published? In the Nettle manual, as some IBM white paper, or as a more-or-less academic paper, e.g., on arxiv? I will not be able to spend much time on writing, but I'd be happy to review.
I'll start writing the papers once I got more details from IBM, similar to intel documents, the document will be academic and practical at the same
Hi Mamone,
What do you need from the IBM side? I may be able to help. We'd definitely like to support you and Niels in publishing your results.
time, I'll dive into finite field equations to demonstrate how we get there as well as I'll add a practical example to clarify the preference of this method in addition to the expected speedup of this method. My intention that other crypto libraries could take advantage of this document or maybe be a starting point for further improvements to the algorithm so I'm checking if IBM would publish or approve such a document the same as intel.
I have a sketch of ARM Neon code doing the equivalent of two vpmsumd, with reasonable parallelism. Quite a lot of instructions needed.
If you don't have much time, you can send it here and I'll continue from that point. I'm planning to compare the new method with the usual method with and without the karatsuba algorithm.
+C Alignment of gcm_key table elements, which is declared in gcm.h
+define(`TableElemAlign', `0x100')
I still find this large constant puzzling. If I try
struct gcm_key key; printf("sizeof (key): %zd, sizeof(key.h[0]): %zd\n", sizeof(key), sizeof(key.h[0]));
(I added it to the start of test_main in gcm-test.c) and run on the gcc112 machine, I get
sizeof (key): 4096, sizeof(key.h[0]): 16
Which is what I'd expect, with elements of size 16 bytes, not 256 bytes.
I haven't yet had the time to read the code carefully.
You see, the alignment of each element is 0x100 (256). The table has 16 elements and you got the size of the table 4096 which is reasonable because 16*256=4096
regards, Mamone _______________________________________________ nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
Hi George, I'll start writing a white paper called "Optimizing Galois-Counter-Mode on PowerPC Architecture Processors". Once I finish the first draft I'll send it to Neils to review it.
What do you need from the IBM side? I may be able to help. We'd definitely like to support you and Niels in publishing your results.
I have a couple of questions: Should we send the paper to you when we make sure everything is ready for publishing? Can you participate as a supervisor to make some decisions like mentioning arm implementation results of the research or making comparison with intel white papers which used for x86, ARM, and PowerPC GCM implementations in OpenSSL library?
regards, Mamone
On Tue, Dec 01, 2020 at 07:55:05PM +0200, Maamoun TK wrote:
Hi George, I'll start writing a white paper called "Optimizing Galois-Counter-Mode on PowerPC Architecture Processors". Once I finish the first draft I'll send it to Neils to review it.
What do you need from the IBM side? I may be able to help. We'd definitely like to support you and Niels in publishing your results.
I have a couple of questions: Should we send the paper to you when we make sure everything is ready for publishing?
Yes, we'd love to review it. I may be able to get someone from IBM Research interested in participating as well to hopefully get deeper crypto and math perspectives.
Can you participate as a supervisor to make some decisions like mentioning arm implementation results of the research or making comparison with intel white papers which used for x86, ARM, and PowerPC GCM implementations in OpenSSL library?
Gladly.
regards, Mamone
nettle-bugs@lists.lysator.liu.se