I created merge requests that have improvements of Poly1305 for arm64, powerpc64, and s390x architectures by following using two-way interleaving. https://git.lysator.liu.se/nettle/nettle/-/merge_requests/38 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/39 https://git.lysator.liu.se/nettle/nettle/-/merge_requests/41 The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64, and 382.65% speedup for s390x.
OpenSSL is still ahead in terms of performance speed since it uses 4-way interleaving or maybe more!! Increasing the interleaving ways more than two has nothing to do with parallelism since the execution units are already saturated by using 2-ways for the three architectures. The reason behind the performance improvement is the number of execution times of reduction procedure is cutted by half for 4-way interleaving since the products of multiplying state parts by key can be combined before the reduction phase. Let me know if you are interested in doing that on nettle!
It would be nice if the arm64 patch will be tested on big-endian mode since I don't have access to any big-endian variant for testing.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64, and 382.65% speedup for s390x.
OpenSSL is still ahead in terms of performance speed since it uses 4-way interleaving or maybe more!! Increasing the interleaving ways more than two has nothing to do with parallelism since the execution units are already saturated by using 2-ways for the three architectures. The reason behind the performance improvement is the number of execution times of reduction procedure is cutted by half for 4-way interleaving since the products of multiplying state parts by key can be combined before the reduction phase. Let me know if you are interested in doing that on nettle!
Interesting. I haven't paid much attention to the poly1305 implementation since it was added back in 2013. The C implementation doesn't try to use wider multiplication than 32x32 --> 64, which is poor for 64-bit platforms. Maybe we could use unsigned __int128 if we can write a configure test to check if it is available and likely to be efficient?
For most efficient interleaving, I take it one should precompute some powers of the key, similar to how it's done in the recent gcm code?
It would be nice if the arm64 patch will be tested on big-endian mode since I don't have access to any big-endian variant for testing.
Merged this one too on a branch for ci testing.
Regards, /Niels
On Wed, Jan 19, 2022 at 10:06 PM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
The patches have 41.88% speedup for arm64, 142.95% speedup for powerpc64, and 382.65% speedup for s390x.
OpenSSL is still ahead in terms of performance speed since it uses 4-way interleaving or maybe more!! Increasing the interleaving ways more than two has nothing to do with parallelism since the execution units are already saturated by using
2-ways
for the three architectures. The reason behind the performance
improvement
is the number of execution times of reduction procedure is cutted by half for 4-way interleaving since the products of multiplying state parts by
key
can be combined before the reduction phase. Let me know if you are interested in doing that on nettle!
Interesting. I haven't paid much attention to the poly1305 implementation since it was added back in 2013. The C implementation doesn't try to use wider multiplication than 32x32 --> 64, which is poor for 64-bit platforms. Maybe we could use unsigned __int128 if we can write a configure test to check if it is available and likely to be efficient?
Wider multiplication would improve the performance for 64-bit general registers but as the case for the current SIMD implementation, the radix 2^26 fits well there.
For most efficient interleaving, I take it one should precompute some powers of the key, similar to how it's done in the recent gcm code?
Since the loop of block iteration is moved to inside the assembly implementation, computing one multiple of key at the function prologue should be ok.
I forgot to mention that the reduction phase uses the tips instructed in Reduction section in https://cryptojedi.org/papers/neoncrypto-20120320.pdf for arm64 and s390x implementations while the chain path of h0 -> h1 -> h2 -> h3 -> h4 -> h0 -> h1 still manages to achieve slightly higher performance than the two independent carry path on powerpc64 arch.
regards, Mamone
It would be nice if the arm64 patch will be tested on big-endian mode
since
I don't have access to any big-endian variant for testing.
Merged this one too on a branch for ci testing.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
Maamoun TK maamoun.tk@googlemail.com writes:
Wider multiplication would improve the performance for 64-bit general registers but as the case for the current SIMD implementation, the radix 2^26 fits well there.
If multiply throughput is the bottleneck, it makes sense to do as much work as possible per multiply. So I don't think I understand the benefits of interleaving, can you explain?
Let's consider the 64-bit case, since that's less writing. B = 2^64 as usual. Then the state is
H = h_2 B^2 + h_1 B + h_0
(with h_2 rather small, depending on how far we normalize for each block, lets assume at most 3 bits, or maybe even h_2 <= 4).
R = r_1 B + r_0
By the spec, high 4 bits of both r_0 and r_1, and low 2 bits of r_1 are zero, which makes mutliplication R H (mod p) particularly nice.
We get
R H = r_0 h_0 + B (r_1 h_0 + r_0 h_1) + B^2 (r_1 h_1 + r_0 h_2) + B^3 r_1 h_2
But then B^2 = 5/4 (mod p), and hence B^2 r_1 = 5 r_1 / 4 (mod p), where the "/ 4" is just shifting out the two low zero bits. So let r_1' = 5 r_1 / 4,
R H = r_0 h_0 + r_1' h_1 + B (r_1 h_0 + r_0 h_1 + r_1' h_2 + B r_0 h_2)
These are 4 long multiplications (64x64 --> 128) and two short, 64x64 --> for the products involving h_2. (The 32-bit version would be 16 long multiplications and 4 short).
From the zero high bits, we also get bounds on these terms,
f_0 = r_0 h_0 + r_1' h_1 < 2^124 + 5*2^122 = 9*2^122
f_1 = r_1 h_0 + r_0 h_1 + r_1' h_2 + B r_0 h_2 < 2^125 + 5*2^61 + 2^127
So these two chains can be added together as 128-bit quantities with no overflow, in any order, there's plendy of parallelism. E.g., power vmsumudm might be useful.
For final folding, we need to split f_1 into top 62 and low 66 bits, multiply low part by 5 (fits in 64 bits), and add into f_0, which still fits in 128 bits.
And then take the top 64 bits of f_0 and add into f_1 (result <= 2^66 bits).
The current C implementation uses radix 26, and 25 multiplies (32x32 --> 64) per block. And quite a lot of shifts. A radix 32 variant analogous to the above would need 16 long multiplies and 4 short. I'd expect that to be faster on most machines, but I'd have to try that out.
In contrast, trying to use a similar scheme for multiplying by (r^2 (mod p)), as needed for an interleaved version, seems more expensive. There are several contributions to the cost:
* First, the accumulation of products by power of B needs to take into account carry, as result can exceed 2^128, so one would need something closer to general schoolbok multiplication.
* Second, since r^2 (mod p) may exceed 2^128, we need three words rather than two, so three more short multiplications to add in.
* Third, we can't pre-divide key words by 4, since low bits are no longer guaranteed to be zero. This gives more expensive reduction, with more multiplies by 5.
The two first points makes smaller radix more attractive; if we need three words for both factors, we can distribute the bits to ensure some of the most significant bits are zero.
Since the loop of block iteration is moved to inside the assembly implementation, computing one multiple of key at the function prologue should be ok.
For large messages, that's fine, but may add a significant cost for messages of just two blocks.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
The current C implementation uses radix 26, and 25 multiplies (32x32 --> 64) per block. And quite a lot of shifts. A radix 32 variant analogous to the above would need 16 long multiplies and 4 short. I'd expect that to be faster on most machines, but I'd have to try that out.
I've tried this out, see attached file. It has an #if 0/1 to choose between radix 64 (depending on the non-standard __int128 type for accumulated products) and radix 32 (portable C).
This is the speed I get for C implementations of poly1305_update on my x86_64 laptop:
* Radix 26: 1.2 GByte/s (old code)
* Radix 32: 1.3 GByte/s
* Radix 64: 2.2 GByte/s
It would be interesting with benchmarks on actual 32-bit hardware, 32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
If I understood correctly, the suggestion to use radix 26 in djb's original paper was motivated by a high-speed implementation using floating point arithmetic (possibly in combination with SIMD), where the product of two 26-bit integers can be represented exactly in an IEEE double (but it gets a bit subtle if we want to accumulate several products), I haven't really looked into implementing poly1305 with either floating point or SIMD.
To improve test coverage, I've also extended poly1305 tests with tests on random inputs, with results compared to a reference implementation based on gmp/mini-gmp. I intend to merge those testing changes soon. See https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755....
Unfortunately, the http interface of the main git repo at Lysator is inaccessible at the moment due to an expired certificate; should be fixed in a day or two.
Regards, /Niels
On Sun, Jan 23, 2022 at 9:10 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
The current C implementation uses radix 26, and 25 multiplies (32x32 --> 64) per block. And quite a lot of shifts. A radix 32 variant analogous to the above would need 16 long multiplies and 4 short. I'd expect that to be faster on most machines, but I'd have to try that out.
I've tried this out, see attached file. It has an #if 0/1 to choose between radix 64 (depending on the non-standard __int128 type for accumulated products) and radix 32 (portable C).
This is the speed I get for C implementations of poly1305_update on my x86_64 laptop:
Radix 26: 1.2 GByte/s (old code)
Radix 32: 1.3 GByte/s
Radix 64: 2.2 GByte/s
It would be interesting with benchmarks on actual 32-bit hardware, 32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
I made a performance test of this patch on the available architectures I have access to.
Arm64 (gcc117 gfarm): * Radix 26: 0.65 GByte/s * Radix 26 (2-way interleaved): 0.92 GByte/s * Radix 32: 0.55 GByte/s * Radix 64: 0.58 GByte/s POWER9: * Radix 26: 0.47 GByte/s * Radix 26 (2-way interleaved): 1.15 GByte/s * Radix 32: 0.52 GByte/s * Radix 64: 0.58 GByte/s Z15: * Radix 26: 0.65 GByte/s * Radix 26 (2-way interleaved): 3.17 GByte/s * Radix 32: 0.82 GByte/s * Radix 64: 1.22 GByte/s
Apparently, the higher radix version has performance improvements on x86_64, powerpc, and s390x but this is not the case for arm64 arch where the performance has a slight hit there.
I tried to compile the new code with -m32 flag on x86_64 but I got "poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this target". Unfortunately, I don't have access to arm 32-bit too.
Also, I've disassembled the update function of Radix 64 and none of the architectures has made use of SIMD support (including x86_64 that hasn't used XMM registers which is standard for this arch, I don't know if gcc supports such behavior for C compiling but I'm aware that MSVC takes advantage of that standardization for further optimization on compiled C code).
I'm trying to implement the radix 64 using SIMD to see if we can get any performance boost, I'll post the result once I get done with it.
regards, Mamone
If I understood correctly, the suggestion to use radix 26 in djb's original paper was motivated by a high-speed implementation using floating point arithmetic (possibly in combination with SIMD), where the product of two 26-bit integers can be represented exactly in an IEEE double (but it gets a bit subtle if we want to accumulate several products), I haven't really looked into implementing poly1305 with either floating point or SIMD.
To improve test coverage, I've also extended poly1305 tests with tests on random inputs, with results compared to a reference implementation based on gmp/mini-gmp. I intend to merge those testing changes soon. See
https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755... .
Unfortunately, the http interface of the main git repo at Lysator is inaccessible at the moment due to an expired certificate; should be fixed in a day or two.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
On Sun, Jan 23, 2022 at 4:41 PM Maamoun TK maamoun.tk@googlemail.com wrote:
On Sun, Jan 23, 2022 at 9:10 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
The current C implementation uses radix 26, and 25 multiplies (32x32 --> 64) per block. And quite a lot of shifts. A radix 32 variant analogous to the above would need 16 long multiplies and 4 short. I'd expect that to be faster on most machines, but I'd have to try that out.
I've tried this out, see attached file. It has an #if 0/1 to choose between radix 64 (depending on the non-standard __int128 type for accumulated products) and radix 32 (portable C).
This is the speed I get for C implementations of poly1305_update on my x86_64 laptop:
Radix 26: 1.2 GByte/s (old code)
Radix 32: 1.3 GByte/s
Radix 64: 2.2 GByte/s
It would be interesting with benchmarks on actual 32-bit hardware, 32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
I made a performance test of this patch on the available architectures I have access to.
Arm64 (gcc117 gfarm):
- Radix 26: 0.65 GByte/s
- Radix 26 (2-way interleaved): 0.92 GByte/s
- Radix 32: 0.55 GByte/s
- Radix 64: 0.58 GByte/s
POWER9:
- Radix 26: 0.47 GByte/s
- Radix 26 (2-way interleaved): 1.15 GByte/s
- Radix 32: 0.52 GByte/s
- Radix 64: 0.58 GByte/s
Z15:
- Radix 26: 0.65 GByte/s
- Radix 26 (2-way interleaved): 3.17 GByte/s
- Radix 32: 0.82 GByte/s
- Radix 64: 1.22 GByte/s
Apparently, the higher radix version has performance improvements on x86_64, powerpc, and s390x but this is not the case for arm64 arch where the performance has a slight hit there.
That might be an artifact of the specific ARM processor microarchitecture or the memory subsystem of the ARM system, not inherent to the Arm AArch64 architecture and ISA.
- David
I tried to compile the new code with -m32 flag on x86_64 but I got "poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this target". Unfortunately, I don't have access to arm 32-bit too.
Also, I've disassembled the update function of Radix 64 and none of the architectures has made use of SIMD support (including x86_64 that hasn't used XMM registers which is standard for this arch, I don't know if gcc supports such behavior for C compiling but I'm aware that MSVC takes advantage of that standardization for further optimization on compiled C code).
I'm trying to implement the radix 64 using SIMD to see if we can get any performance boost, I'll post the result once I get done with it.
regards, Mamone
If I understood correctly, the suggestion to use radix 26 in djb's original paper was motivated by a high-speed implementation using floating point arithmetic (possibly in combination with SIMD), where the product of two 26-bit integers can be represented exactly in an IEEE double (but it gets a bit subtle if we want to accumulate several products), I haven't really looked into implementing poly1305 with either floating point or SIMD.
To improve test coverage, I've also extended poly1305 tests with tests on random inputs, with results compared to a reference implementation based on gmp/mini-gmp. I intend to merge those testing changes soon. See
https://gitlab.com/gnutls/nettle/-/commit/b48217c8058676c8cd2fd12cdeba457755... .
Unfortunately, the http interface of the main git repo at Lysator is inaccessible at the moment due to an expired certificate; should be fixed in a day or two.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
On Mon, Jan 24, 2022 at 12:58 AM David Edelsohn dje.gcc@gmail.com wrote:
On Sun, Jan 23, 2022 at 4:41 PM Maamoun TK maamoun.tk@googlemail.com wrote:
On Sun, Jan 23, 2022 at 9:10 PM Niels Möller nisse@lysator.liu.se
wrote:
nisse@lysator.liu.se (Niels Möller) writes:
The current C implementation uses radix 26, and 25 multiplies (32x32 --> 64) per block. And quite a lot of shifts. A radix 32 variant analogous to the above would need 16 long multiplies and 4 short. I'd expect that to be faster on most machines, but I'd have to try that
out.
I've tried this out, see attached file. It has an #if 0/1 to choose between radix 64 (depending on the non-standard __int128 type for accumulated products) and radix 32 (portable C).
This is the speed I get for C implementations of poly1305_update on my x86_64 laptop:
Radix 26: 1.2 GByte/s (old code)
Radix 32: 1.3 GByte/s
Radix 64: 2.2 GByte/s
It would be interesting with benchmarks on actual 32-bit hardware, 32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
I made a performance test of this patch on the available architectures I have access to.
Arm64 (gcc117 gfarm):
- Radix 26: 0.65 GByte/s
- Radix 26 (2-way interleaved): 0.92 GByte/s
- Radix 32: 0.55 GByte/s
- Radix 64: 0.58 GByte/s
POWER9:
- Radix 26: 0.47 GByte/s
- Radix 26 (2-way interleaved): 1.15 GByte/s
- Radix 32: 0.52 GByte/s
- Radix 64: 0.58 GByte/s
Z15:
- Radix 26: 0.65 GByte/s
- Radix 26 (2-way interleaved): 3.17 GByte/s
- Radix 32: 0.82 GByte/s
- Radix 64: 1.22 GByte/s
Apparently, the higher radix version has performance improvements on x86_64, powerpc, and s390x but this is not the case for arm64 arch where the performance has a slight hit there.
That might be an artifact of the specific ARM processor microarchitecture or the memory subsystem of the ARM system, not inherent to the Arm AArch64 architecture and ISA.
Seems right, I've tested the enhanced versions of wider multiplication on other gfarm aarch64 instances and I got different results.
On Mon, Jan 24, 2022 at 10:01 AM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
I made a performance test of this patch on the available architectures I have access to.
Arm64 (gcc117 gfarm):
- Radix 26: 0.65 GByte/s
- Radix 26 (2-way interleaved): 0.92 GByte/s
- Radix 32: 0.55 GByte/s
- Radix 64: 0.58 GByte/s
POWER9:
- Radix 26: 0.47 GByte/s
- Radix 26 (2-way interleaved): 1.15 GByte/s
- Radix 32: 0.52 GByte/s
- Radix 64: 0.58 GByte/s
Z15:
- Radix 26: 0.65 GByte/s
- Radix 26 (2-way interleaved): 3.17 GByte/s
- Radix 32: 0.82 GByte/s
- Radix 64: 1.22 GByte/s
Interesting. I'm a bit surprised the radix-64 doesn't perform better, in particular on arm64. (But I'm not yet familiar with arm64 multiply instructions).
It looks like wider multiplication would achieve higher speed on different aarch64 instance on gfarm. Here are the numbers on gcc185 instance:
* Radix 26: 0.83 GByte/s * Radix 26 (2-way interleaved): 0.70 GByte/s * Radix 64 (Latest version): 1.25 GByte/s
These numbers are a bit of a surprise too since the 2-way interleaving is supposed to perform better than the old C version similarly to other architectures! Anyway, the benchmark numbers of powerpc and s390x were not taken from gfarm instances and it's ok to be based on.
Numbers for 2-way interleaving are impressive, I'd like to understand how that works.
Vector 32-bit multiplication applies two multiply operations on inputs and places the concatenated results on the destination vector. So the current simd/altivec implementations interleave two blocks horizontally over vector registers and execute both multiplication and reduction phases on both blocks simultaneously. However, after each block iteration we should combine the two states together by splitting the concatenated value and adding it to the origin but to avoid that overhead Ione can multiply both state parts with r^2 except for the last two blocks that imply multiplying the first part with r^2 and the second one with r. Let's consider a message of 4-blocks b0,b1,b2,b3 multiplying state by hash has the sequence h = (h+b0) r^4 + b1 r^3 + b2 r^2 + b3 r With interleaved implementation this sequence is executed in two iteration. First iteration: (h+b0) r^2 || b1 r^2 Second iteration: ((h+b0) r^2 + b2) r^2 || (b1 r^2 + b3) r
When getting out of the loop we combine the two state parts together so we get the same correct sequence of r powers for each block.
Also, the two-independent carry technique that mentioned previously overlaps two carry procedures with each other including the long carry from h4 to h0 which offers the opportunity for further boost.
Might be useful derive corresponding multiply
throughput, i.e., number of multiply operations (and with which multiply instruction) completed per cycle, as well as total cycles per block
I'm not sure if I can depend on gfarm machines for such a purpose as it seems to make incoherent performance results to me. I think I should look for alternatives for more reliable results.
On Mon, Jan 24, 2022 at 10:56 PM Niels Möller nisse@lysator.liu.se wrote:
I've tried reworking folding, to reduce latency. Idea is to let the most significant state word be close to a word, rather than limited to <= 4 as in the previous version. When multiplying by r, split one of the multiplies to take out the low 2 bits. For the radix 64 version, that term is
B^2 t_2 * r0
Split t_2 as 4*hi + lo, then this can be reduced to
B^2 lo * r0 + hi * 5*r0
(Using the same old B^2 = 5/4 (mod p) in a slightly different way).
The 5*r0 fits one word and can be precomputed, and then this multiplication goes in parallell with the other multiplies, and no multiply left in the final per-block folding. With this trick I get on the same machine
Radix 32: 1.65 GByte/s
Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
I haven't yet done a strict analysis of bounds on the state and temporaries, but I would expect that it works out with no possibility of overflow.
See attached file. To fit the precomputed 5*r0 in a nice way I had to rearrange the unions in struct poly1305_ctx a bit, I also attach the patch to do this. Size of the struct should be the same, so I think it can be done without any abi bump.
Great! It performs better on all tested architectures. Apparently, AArch64 SIMD doesn't support 64*64->128 vector multiplication so I've implemented this version on powerpc by utilizing vmsumudm power9-specific instruction. I got 0.62 GByte/s for the C version and 0.65 GByte/s for the assembly version, I'll attach the hardware implementation in this email.
regards, Mamone
Maamoun TK maamoun.tk@googlemail.com writes:
It looks like wider multiplication would achieve higher speed on different aarch64 instance on gfarm. Here are the numbers on gcc185 instance:
- Radix 26: 0.83 GByte/s
- Radix 26 (2-way interleaved): 0.70 GByte/s
- Radix 64 (Latest version): 1.25 GByte/s
These numbers are a bit of a surprise too since the 2-way interleaving is supposed to perform better than the old C version similarly to other architectures! Anyway, the benchmark numbers of powerpc and s390x were not taken from gfarm instances and it's ok to be based on.
In the meantime, I pushed the latest C radix-32 version to a branch poly1305-radix32, and benchmarked on one of the 32-bit ARM boards that are part of the GMP test systems (the odxu4 system on https://gmplib.org/devel/testsystems, labeled Cortex-A15/A7). I got:
Radix 26 (old code): 326 MByte/s Radix 32 (new code): 260 MByte/s
So radix-32 doesn't seem to be a good option there. I had a quick look at the generated assembly, besides the expected umull and umlal instructions to do the main multiply work, there's an awful lot of loads and stores to the stack, roughly half of the instructions. The compiler installed is gcc-5.4, rather old.
Vector 32-bit multiplication applies two multiply operations on inputs and places the concatenated results on the destination vector. So the current simd/altivec implementations interleave two blocks horizontally over vector registers and execute both multiplication and reduction phases on both blocks simultaneously. However, after each block iteration we should combine the two states together by splitting the concatenated value and adding it to the origin but to avoid that overhead Ione can multiply both state parts with r^2 except for the last two blocks that imply multiplying the first part with r^2 and the second one with r. Let's consider a message of 4-blocks b0,b1,b2,b3 multiplying state by hash has the sequence h = (h+b0) r^4 + b1 r^3 + b2 r^2 + b3 r With interleaved implementation this sequence is executed in two iteration. First iteration: (h+b0) r^2 || b1 r^2 Second iteration: ((h+b0) r^2 + b2) r^2 || (b1 r^2 + b3) r
When getting out of the loop we combine the two state parts together so we get the same correct sequence of r powers for each block.
Also, the two-independent carry technique that mentioned previously overlaps two carry procedures with each other including the long carry from h4 to h0 which offers the opportunity for further boost.
Hmm, it seems that avoiding long carry chains is the main reason why radix 26 can be faster.
Great! It performs better on all tested architectures. Apparently, AArch64 SIMD doesn't support 64*64->128 vector multiplication so I've implemented this version on powerpc by utilizing vmsumudm power9-specific instruction. I got 0.62 GByte/s for the C version and 0.65 GByte/s for the assembly version, I'll attach the hardware implementation in this email.
But you had 1.15 GByte/s for the 2-way interleaved version on this machine?
define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_poly1305_block) ld H0, 32(CTX) ld H1, 40(CTX) ld H2, 48(CTX) ld T0, 0(M) ld T1, 8(M)
addc T0, T0, H0 adde T1, T1, H1 adde T2, M128, H2
li IDX, 16 lxvd2x VSR(R), 0, CTX lxvd2x VSR(S), IDX, CTX
li RZ, 0 vxor ZERO, ZERO, ZERO vxor F0, F0, F0 vxor F1, F1, F1 vxor TMP, TMP, TMP
xxpermdi VSR(MU0), VSR(R), VSR(S), 0b01 xxswapd VSR(MU1), VSR(R)
mtvsrdd VSR(T), T0, T1 mtvsrdd VSR(T10), 0, T2 andi. T2A, T2, 3 mtvsrdd VSR(T11), 0, T2A srdi T2A, T2, 2 mtvsrdd VSR(T00), T2A, RZ
I don't get all of the setup, but perhaps it would be better to load input (T0, T1) and state (H0, H1, H2) directly into vector registers, and avoid move between regular registers and vectors.
For the R and S values, the key setup could store them in the right order so they don't have to be permuted after load.
vmsumudm F0, T, MU0, F0 vmsumudm F1, T, MU1, F1 vmsumudm TMP, T11, MU1, TMP
vmsumudm F0, T00, S, F0 vmsumudm F1, T10, MU0, F1
This part is as neat as I had hoped! Is there some variant of the instructions that writes the result register without adding, to avoid the explicit clearing of F0 and F1? It may also be doable with one instruction less; the 5 instructions does 10 multiplies, but I think we use only 7, the rest must somehow be zeroed or ignored.
xxmrgld VSR(TMP), VSR(TMP), VSR(ZERO) li IDX, 32 xxswapd VSR(F0), VSR(F0) vadduqm F1, F1, TMP stxsdx VSR(F0), IDX, CTX
li IDX, 40 xxmrgld VSR(F0), VSR(ZERO), VSR(F0) vadduqm F1, F1, F0 xxswapd VSR(F1), VSR(F1) stxvd2x VSR(F1), IDX, CTX
This is looks a bit verbose, if what we need to do is just to add high part of F0 to low part of F1 (with carry to the high part of F1), and store the result?
Regards, /Niels
On Tue, Jan 25, 2022 at 10:24 PM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
It looks like wider multiplication would achieve higher speed on
different
aarch64 instance on gfarm. Here are the numbers on gcc185 instance:
- Radix 26: 0.83 GByte/s
- Radix 26 (2-way interleaved): 0.70 GByte/s
- Radix 64 (Latest version): 1.25 GByte/s
These numbers are a bit of a surprise too since the 2-way interleaving is supposed to perform better than the old C version similarly to other architectures! Anyway, the benchmark numbers of powerpc and s390x were not taken from gfarm instances and it's ok to be based on.
In the meantime, I pushed the latest C radix-32 version to a branch poly1305-radix32, and benchmarked on one of the 32-bit ARM boards that are part of the GMP test systems (the odxu4 system on https://gmplib.org/devel/testsystems, labeled Cortex-A15/A7). I got:
Radix 26 (old code): 326 MByte/s Radix 32 (new code): 260 MByte/s
So radix-32 doesn't seem to be a good option there. I had a quick look at the generated assembly, besides the expected umull and umlal instructions to do the main multiply work, there's an awful lot of loads and stores to the stack, roughly half of the instructions. The compiler installed is gcc-5.4, rather old.
Vector 32-bit multiplication applies two multiply operations on inputs
and
places the concatenated results on the destination vector. So the current simd/altivec implementations interleave two blocks horizontally over
vector
registers and execute both multiplication and reduction phases on both blocks simultaneously. However, after each block iteration we should combine the two states together by splitting the concatenated value and adding it to the origin but to avoid that overhead Ione can multiply both state parts with r^2 except for the last two blocks that imply
multiplying
the first part with r^2 and the second one with r. Let's consider a message of 4-blocks b0,b1,b2,b3 multiplying state by
hash
has the sequence h = (h+b0) r^4 + b1 r^3 + b2 r^2 + b3 r With interleaved implementation this sequence is executed in two
iteration.
First iteration: (h+b0) r^2 || b1 r^2 Second iteration: ((h+b0) r^2 + b2) r^2 || (b1 r^2 + b3) r
When getting out of the loop we combine the two state parts together so
we
get the same correct sequence of r powers for each block.
Also, the two-independent carry technique that mentioned previously overlaps two carry procedures with each other including the long carry
from
h4 to h0 which offers the opportunity for further boost.
Hmm, it seems that avoiding long carry chains is the main reason why radix 26 can be faster.
Yes, with the sequential carry path the SIMD function would slightly beat the C function of radix 26.
Great! It performs better on all tested architectures. Apparently,
AArch64
SIMD doesn't support 64*64->128 vector multiplication so I've implemented this version on powerpc by utilizing vmsumudm power9-specific
instruction.
I got 0.62 GByte/s for the C version and 0.65 GByte/s for the assembly version, I'll attach the hardware implementation in this email.
But you had 1.15 GByte/s for the 2-way interleaved version on this machine?
Right, the 2-way interleaved version is more efficient than this one and supports POWER7+ processors.
define(`FUNC_ALIGN', `5') PROLOGUE(_nettle_poly1305_block) ld H0, 32(CTX) ld H1, 40(CTX) ld H2, 48(CTX) ld T0, 0(M) ld T1, 8(M)
addc T0, T0, H0 adde T1, T1, H1 adde T2, M128, H2 li IDX, 16 lxvd2x VSR(R), 0, CTX lxvd2x VSR(S), IDX, CTX li RZ, 0 vxor ZERO, ZERO, ZERO vxor F0, F0, F0 vxor F1, F1, F1 vxor TMP, TMP, TMP xxpermdi VSR(MU0), VSR(R), VSR(S), 0b01 xxswapd VSR(MU1), VSR(R) mtvsrdd VSR(T), T0, T1 mtvsrdd VSR(T10), 0, T2 andi. T2A, T2, 3 mtvsrdd VSR(T11), 0, T2A srdi T2A, T2, 2 mtvsrdd VSR(T00), T2A, RZ
I don't get all of the setup, but perhaps it would be better to load input (T0, T1) and state (H0, H1, H2) directly into vector registers, and avoid move between regular registers and vectors.
I was having difficulty using vector addition with carry so I got to deal with the general register for that purpose. Also, general AND and Shift operations are more easier to use than the vector ones since the latter requires setting up a vector register for the immediate value.
For the R and S values, the key setup could store them in the right order so they don't have to be permuted after load.
vmsumudm F0, T, MU0, F0 vmsumudm F1, T, MU1, F1 vmsumudm TMP, T11, MU1, TMP vmsumudm F0, T00, S, F0 vmsumudm F1, T10, MU0, F1
This part is as neat as I had hoped! Is there some variant of the instructions that writes the result register without adding, to avoid the explicit clearing of F0 and F1? It may also be doable with one instruction less; the 5 instructions does 10 multiplies, but I think we use only 7, the rest must somehow be zeroed or ignored.
POWER10 adds 'vmuloud' and 'vmuleud' for one doubleword multiply which fits well here. Now I realized that there is no need to clear F0, F1, TMP registers since we can use ZERO register in place of the fourth operand. However, the purpose of this implementation is to get an approximate measurement of speed up in comparison to C version to vouch for adapting 2-way interleaving as a high-performance implementation.
xxmrgld VSR(TMP), VSR(TMP), VSR(ZERO) li IDX, 32 xxswapd VSR(F0), VSR(F0) vadduqm F1, F1, TMP stxsdx VSR(F0), IDX, CTX li IDX, 40 xxmrgld VSR(F0), VSR(ZERO), VSR(F0) vadduqm F1, F1, F0 xxswapd VSR(F1), VSR(F1) stxvd2x VSR(F1), IDX, CTX
This is looks a bit verbose, if what we need to do is just to add high part of F0 to low part of F1 (with carry to the high part of F1), and store the result?
I couldn't find a neat way to do that so I sticked with the C theme besides some vector adjusting operations.
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
Maamoun TK maamoun.tk@googlemail.com writes:
I made a performance test of this patch on the available architectures I have access to.
Arm64 (gcc117 gfarm):
- Radix 26: 0.65 GByte/s
- Radix 26 (2-way interleaved): 0.92 GByte/s
- Radix 32: 0.55 GByte/s
- Radix 64: 0.58 GByte/s
POWER9:
- Radix 26: 0.47 GByte/s
- Radix 26 (2-way interleaved): 1.15 GByte/s
- Radix 32: 0.52 GByte/s
- Radix 64: 0.58 GByte/s
Z15:
- Radix 26: 0.65 GByte/s
- Radix 26 (2-way interleaved): 3.17 GByte/s
- Radix 32: 0.82 GByte/s
- Radix 64: 1.22 GByte/s
Interesting. I'm a bit surprised the radix-64 doesn't perform better, in particular on arm64. (But I'm not yet familiar with arm64 multiply instructions).
Numbers for 2-way interleaving are impressive, I'd like to understand how that works. Might be useful derive corresponding multiply throughput, i.e., number of multiply operations (and with which multiply instruction) completed per cycle, as well as total cycles per block
It looks like the folding done per-block in the radix-64 code costs at least 5 or so cycles per block (since these operations are all dependent, and we also have the multiply by 5 in there, probably adding a few cycles more). Maybe at least the multiply can be postponed.
I tried to compile the new code with -m32 flag on x86_64 but I got "poly1305-internal.c:46:18: error: ‘__int128’ is not supported on this target".
That's expected, in two ways: I don't expect radix-64 to give any performance gain over radix-32 on any 32-bit archs. And I think __int128 is supported only on archs where it fits in two registers. If we start using __int128 we need a configure test for it, and then it actually makes things simpler, at least for this in this usecase, if it stays unsupported on 32-bit archs where it shouldn't be used.
So to compile with -m32, the radix-64 code must be #if:ed out.
Also, I've disassembled the update function of Radix 64 and none of the architectures has made use of SIMD support (including x86_64 that hasn't used XMM registers which is standard for this arch, I don't know if gcc supports such behavior for C compiling but I'm aware that MSVC takes advantage of that standardization for further optimization on compiled C code).
The radix-64 code really wants multiply instruction(s) for 64x64 --> 128, and I think that's not so common SIMD instruction sets (but powerpc64 vmsumudm looks potentially useful?) Either as a single instruction, or as a pair of mulhigh/mullow instructions. And some not too complicated way to do a 128-bit add with proper carry propagation in the middle.
Arm32 neon does have 32x32 --> 64, which looks like a good fit for the radix-32 variant.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
This is the speed I get for C implementations of poly1305_update on my x86_64 laptop:
Radix 26: 1.2 GByte/s (old code)
Radix 32: 1.3 GByte/s
Radix 64: 2.2 GByte/s
It would be interesting with benchmarks on actual 32-bit hardware, 32-bit ARM likely being the most relevant arch.
For comparison, the current x86_64 asm version: 2.5 GByte/s.
I've tried reworking folding, to reduce latency. Idea is to let the most significant state word be close to a word, rather than limited to <= 4 as in the previous version. When multiplying by r, split one of the multiplies to take out the low 2 bits. For the radix 64 version, that term is
B^2 t_2 * r0
Split t_2 as 4*hi + lo, then this can be reduced to
B^2 lo * r0 + hi * 5*r0
(Using the same old B^2 = 5/4 (mod p) in a slightly different way).
The 5*r0 fits one word and can be precomputed, and then this multiplication goes in parallell with the other multiplies, and no multiply left in the final per-block folding. With this trick I get on the same machine
Radix 32: 1.65 GByte/s
Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
I haven't yet done a strict analysis of bounds on the state and temporaries, but I would expect that it works out with no possibility of overflow.
See attached file. To fit the precomputed 5*r0 in a nice way I had to rearrange the unions in struct poly1305_ctx a bit, I also attach the patch to do this. Size of the struct should be the same, so I think it can be done without any abi bump.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
This is the speed I get for C implementations of poly1305_update on my x86_64 laptop:
Radix 26: 1.2 GByte/s (old code)
Radix 32: 1.3 GByte/s
Radix 64: 2.2 GByte/s
[...]
For comparison, the current x86_64 asm version: 2.5 GByte/s.
[...]
I've tried reworking folding, to reduce latency [...] With this trick I get on the same machine
Radix 32: 1.65 GByte/s
Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
And I've now tried the same method for the x86_64 implementation. See attached file + needed patch to asm.m4. This gives 2.9 GByte/s.
I'm not entirely sure cycle numbers are accurate, with clock frequence not being fixed. I think the machine runs bechmarks at 2.1GHz, and then this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4 instructions per cycle, 0.5 multiply instructions per cycle.
This laptop has an AMD zen2 processor, which should be capable of issuing four instructions per cycle and complete one multiply instruction per cycle (according to https://gmplib.org/~tege/x86-timing.pdf).
This seems to indicate that on this hardware, speed is not limited by multiplier throughput, instead, the bottleneck is instruction decoding/issuing, with max four instructions per cycle.
Regards, /Niels
C x86_64/poly1305-internal.asm
ifelse(` Copyright (C) 2013 Niels Möller
This file is part of GNU Nettle.
GNU Nettle is free software: you can redistribute it and/or modify it under the terms of either:
* the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.
or
* the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
or both in parallel, as here.
GNU Nettle is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received copies of the GNU General Public License and the GNU Lesser General Public License along with this program. If not, see http://www.gnu.org/licenses/. ')
.file "poly1305-internal.asm"
C Registers mainly used by poly1305_block define(`CTX', `%rdi') C First argument to all functions
define(`KEY', `%rsi') define(`MASK',` %r8') C _poly1305_set_key(struct poly1305_ctx *ctx, const uint8_t key[16]) .text ALIGN(16) PROLOGUE(_nettle_poly1305_set_key) W64_ENTRY(2,0) mov $0x0ffffffc0fffffff, MASK mov (KEY), %rax and MASK, %rax and $-4, MASK mov %rax, P1305_R0 (CTX) imul $5, %rax mov %rax, P1305_S0 (CTX) mov 8(KEY), %rax and MASK, %rax mov %rax, P1305_R1 (CTX) shr $2, %rax imul $5, %rax mov %rax, P1305_S1 (CTX) xor XREG(%rax), XREG(%rax) mov %rax, P1305_H0 (CTX) mov %rax, P1305_H1 (CTX) mov %rax, P1305_H2 (CTX) W64_EXIT(2,0) ret
undefine(`KEY') undefine(`MASK')
EPILOGUE(_nettle_poly1305_set_key)
define(`T0', `%rcx') define(`T1', `%rsi') C Overlaps message input pointer. define(`T2', `%r8') define(`H0', `%r9') define(`H1', `%r10') define(`F0', `%r11') define(`F1', `%r12')
C Compute in parallel C C {H1,H0} = R0 T0 + S1 T1 + S0 (T2 >> 2) C {F1,F0} = R1 T0 + R0 T1 + S1 T2 C T = R0 * (T2 & 3) C C Then accumulate as C C +--+--+--+ C |T |H1|H0| C +--+--+--+ C + |F1|F0| C --+--+--+--+ C |H2|H1|H0| C +--+--+--+
C _poly1305_block (struct poly1305_ctx *ctx, const uint8_t m[16], unsigned hi) PROLOGUE(_nettle_poly1305_block) W64_ENTRY(3, 0) push %r12 mov (%rsi), T0 mov 8(%rsi), T1 mov XREG(%rdx), XREG(T2) C Also zero extends
add P1305_H0 (CTX), T0 adc P1305_H1 (CTX), T1 adc P1305_H2 (CTX), T2
mov P1305_R1 (CTX), %rax mul T0 C R1 T0 mov %rax, F0 mov %rdx, F1
mov T0, %rax C Last use of T0 input mov P1305_R0 (CTX), T0 mul T0 C R0*T0 mov %rax, H0 mov %rdx, H1
mov T1, %rax mul T0 C R0*T1 add %rax, F0 adc %rdx, F1
mov P1305_S1 (CTX), T0 mov T1, %rax C Last use of T1 input mul T0 C S1*T1 add %rax, H0 adc %rdx, H1
mov T2, %rax mul T0 C S1*T2 add %rax, F0 adc %rdx, F1
mov $3, XREG(T1) and T2, T1
shr $2, T2 mov P1305_S0 (CTX), %rax mul T2 C S0*(T2 >> 2) add %rax, H0 adc %rdx, H1
imul P1305_R0 (CTX), T1 C R0*(T2 & 3) add F0, H1 adc T1, F1
mov H0, P1305_H0 (CTX) mov H1, P1305_H1 (CTX) mov F1, P1305_H2 (CTX) pop %r12 W64_EXIT(3, 0) ret EPILOGUE(_nettle_poly1305_block) undefine(`T0') undefine(`T1') undefine(`T2') undefine(`H0') undefine(`H1') undefine(`F0') undefine(`F1')
C _poly1305_digest (struct poly1305_ctx *ctx, uint8_t *s) define(`S', `%rsi')
define(`T0', `%rcx') define(`T1', `%r8') define(`H0', `%r9') define(`H1', `%r10') define(`F0', `%r11') define(`F1', `%rrd') C Overlaps CTX
PROLOGUE(_nettle_poly1305_digest) W64_ENTRY(2, 0)
mov P1305_H0 (CTX), H0 mov P1305_H1 (CTX), H1 mov P1305_H2 (CTX), F0
xor XREG(%rax), XREG(%rax) mov %rax, P1305_H0 (CTX) mov %rax, P1305_H1 (CTX) mov %rax, P1305_H2 (CTX)
mov $3, XREG(%rax) and XREG(F0), XREG(%rax) shr $2, F0 imul $5, F0 add F0, H0 adc $0, H1 adc $0, XREG(%rax)
C Add 5, use result if >= 2^130 mov $5, T0 xor T1, T1 add H0, T0 adc H1, T1 adc $0, XREG(%rax) C Use adc $-4 ? cmp $4, XREG(%rax) cmovnc T0, H0 cmovnc T1, H1
add H0, (S) adc H1, 8(S)
W64_EXIT(2, 0) ret EPILOGUE(_nettle_poly1305_digest)
nisse@lysator.liu.se (Niels Möller) writes:
Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
And I've now tried the same method for the x86_64 implementation. See attached file + needed patch to asm.m4. This gives 2.9 GByte/s.
I'm not entirely sure cycle numbers are accurate, with clock frequence not being fixed. I think the machine runs bechmarks at 2.1GHz, and then this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4 instructions per cycle, 0.5 multiply instructions per cycle.
This laptop has an AMD zen2 processor, which should be capable of issuing four instructions per cycle and complete one multiply instruction per cycle (according to https://gmplib.org/~tege/x86-timing.pdf).
This seems to indicate that on this hardware, speed is not limited by multiplier throughput, instead, the bottleneck is instruction decoding/issuing, with max four instructions per cycle.
Benchmarked also on my other nearby x86_64 machine (intel broadwell processor). It's faster there too (from 1.4 GByte/s to 1.75). I'd expect it to be generally faster, and have pushed it to the master-updates branch.
I haven't looked that carefully at what the old code was doing, but I think the final folding for each block used a multiply instruction that then depends on the previous ones for that block, increasing the per block latency. With the new code, all multiplies done for a block are independent of each other.
Regards, /Niels
On Thu, Jan 27, 2022 at 11:28 PM Niels Möller nisse@lysator.liu.se wrote:
nisse@lysator.liu.se (Niels Möller) writes:
Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
And I've now tried the same method for the x86_64 implementation. See attached file + needed patch to asm.m4. This gives 2.9 GByte/s.
I'm not entirely sure cycle numbers are accurate, with clock frequence not being fixed. I think the machine runs bechmarks at 2.1GHz, and then this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4 instructions per cycle, 0.5 multiply instructions per cycle.
This laptop has an AMD zen2 processor, which should be capable of issuing four instructions per cycle and complete one multiply instruction per cycle (according to https://gmplib.org/~tege/x86-timing.pdf).
This seems to indicate that on this hardware, speed is not limited by multiplier throughput, instead, the bottleneck is instruction decoding/issuing, with max four instructions per cycle.
Benchmarked also on my other nearby x86_64 machine (intel broadwell processor). It's faster there too (from 1.4 GByte/s to 1.75). I'd expect it to be generally faster, and have pushed it to the master-updates branch.
I haven't looked that carefully at what the old code was doing, but I think the final folding for each block used a multiply instruction that then depends on the previous ones for that block, increasing the per block latency. With the new code, all multiplies done for a block are independent of each other.
Great! I believe this is the best we can get for processing one block. I'm trying to implement two-way interleaving using AVX extension and the main instruction of interest here is 'vpmuludq' that does double multiply operation, the main concern here is there's a shortage of XMM registers as there are 16 of them, I'm working on addressing this issue by using memory operands of key values for 'vpmuludq' and hope the processor cache do his thing here. I'm expecting to complete the assembly implementation tomorrow.
regards, Mamone
Regards, /Niels
-- Niels Möller. PGP key CB4962D070D77D7FCB8BA36271D8F1FF368C6677. Internet email is subject to wholesale government surveillance.
Maamoun TK maamoun.tk@googlemail.com writes:
Great! I believe this is the best we can get for processing one block.
One may be able to squeeze out one or two cycles more using the mulx extension, which should make it possible to eliminate some of the move instructions (I don't think moves cost any execution unit resources, but they do consume decoding resources).
I'm trying to implement two-way interleaving using AVX extension and the main instruction of interest here is 'vpmuludq' that does double multiply operation
My manual seems a bit confused if it's called pmuludq or vpmuludq. But you're thinking of the instruction that does two 32x32 --> 64 multiplies? It will be interesting to see how that works out! It does half the work compared to a 64 x 64 --> 128 multiply instruction, but accumulation/folding may get more efficient by using vector registers. (There seems to also be an avx variant doing four 32x32 --> 64 multiplies, using 256-bit registers).
the main concern here is there's a shortage of XMM registers as there are 16 of them, I'm working on addressing this issue by using memory operands of key values for 'vpmuludq' and hope the processor cache do his thing here.
Reading cached values from memory is usally cheap. So probably fine as long as values modified are kept in registers.
I'm expecting to complete the assembly implementation tomorrow.
If my analysis of the single-block code is right, I'd expect it to be rather important to trim number of instructions per block.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se