I've written some new ARM Neon assembly for salsa20. See https://gitlab.com/gnutls/nettle/-/commit/2ac58a1ce729a6cfe1d3703f4deb6da886..., when configured with --enable-arm-neon.
It interleaves the processing of two blocks, which gives a speedup of 50% -- 100% on the ARM cores where I've tested it. Before merging, I need to fix fat builds to use the new code on processors that support it.
To make it work also on big-endian ARM, I'd need some help. (I think the qemu-user package supports big-endian ARM, at least, it includes a program named qemu-armeb. But I'm missing a cross compiler and cross debugger).
I'd like to do the same for x86_64. And for chacha, it might give even greater speedup to interleave processing of three blocks, which may be possible since I think chacha needs fewer registers for temporaries.
For both x86_64 and ARM neon, the current code uses 128-bit wide registers. Processors with 256-bit wide simd registers (at least 16 of them) could do twice as many blocks at a time.
Regards, /Niels
I would like to help but I have no clue or experience with ARM NEON, sorry.
regards, Mamone
On Tue, Jul 7, 2020 at 5:46 PM Niels Möller nisse@lysator.liu.se wrote:
I've written some new ARM Neon assembly for salsa20. See
https://gitlab.com/gnutls/nettle/-/commit/2ac58a1ce729a6cfe1d3703f4deb6da886... , when configured with --enable-arm-neon.
It interleaves the processing of two blocks, which gives a speedup of 50% -- 100% on the ARM cores where I've tested it. Before merging, I need to fix fat builds to use the new code on processors that support it.
To make it work also on big-endian ARM, I'd need some help. (I think the qemu-user package supports big-endian ARM, at least, it includes a program named qemu-armeb. But I'm missing a cross compiler and cross debugger).
I'd like to do the same for x86_64. And for chacha, it might give even greater speedup to interleave processing of three blocks, which may be possible since I think chacha needs fewer registers for temporaries.
For both x86_64 and ARM neon, the current code uses 128-bit wide registers. Processors with 256-bit wide simd registers (at least 16 of them) could do twice as many blocks at a time.
Regards, /Niels
-- Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677. Internet email is subject to wholesale government surveillance.
nettle-bugs mailing list nettle-bugs@lists.lysator.liu.se http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs
nisse@lysator.liu.se (Niels Möller) writes:
It interleaves the processing of two blocks, which gives a speedup of 50% -- 100% on the ARM cores where I've tested it. Before merging, I need to fix fat builds to use the new code on processors that support it.
I've added the fat build support, which needed a bit of reorganization, and mergged to master. This will break support for big-endian ARM for now, since I'm not able to test that.
Regards, /Niels
Hello Niels,
sorry for the delay - I've been on vacation.
On Thu, Jul 09, 2020 at 04:05:21PM +0200, Niels Möller wrote:
This will break support for big-endian ARM for now, since I'm not able to test that.
We still have the ARM BE CI ready to go. Is it maybe time to get it activated on GitLab? I've put it in an MR for reference (https://git.lysator.liu.se/nettle/nettle/-/merge_requests/8) but can also submit via the list once we've decided where to put the container images for good. I'd still vote for GnuTLS's build-images, as the others. nettle could have its own clone of it as well, I guess.
I've run branch master-updates through it and it compiles and runs the testsuite fine:
https://gitlab.com/michaelweiser/nettle/-/jobs/648326607
master indeed fails:
https://gitlab.com/michaelweiser/nettle/-/jobs/648334928
libnettle: cpu features: arch:6,neon libnettle: enabling armv6 code. libnettle: enabling neon code. Assert failed: testutils.c:831: MEMEQ(length, data, ciphertext->data) qemu: uncaught target signal 6 (Aborted) - core dumped Aborted (core dumped) FAIL: chacha-poly1305
Is this about what you've expected? Then I'll look into it.
Any other branches I should try?
Michael Weiser michael.weiser@gmx.de writes:
sorry for the delay - I've been on vacation.
No problem. If you can test and debug arm big-endian, that's apprecated.
We still have the ARM BE CI ready to go. Is it maybe time to get it activated on GitLab? I've put it in an MR for reference (https://git.lysator.liu.se/nettle/nettle/-/merge_requests/8) but can also submit via the list once we've decided where to put the container images for good. I'd still vote for GnuTLS's build-images, as the others.
If the gnutls people are willing to host it, that would be nice. Do you think that can happen soon? Otherwise, I'd be happy to merge as is.
I think Nikos wrote a while back that he's less active in gnutls, so I'm not sure who we'd need to coordinate with. (And I haven't followed all the details in how you generate the buildroot images).
BTW, I've noticed that the debian qemu-user package does include qemu-armeb, but still no packaged armeb cross compiler, as far as I'm aware.
master indeed fails:
https://gitlab.com/michaelweiser/nettle/-/jobs/648334928
libnettle: cpu features: arch:6,neon libnettle: enabling armv6 code. libnettle: enabling neon code. Assert failed: testutils.c:831: MEMEQ(length, data, ciphertext->data) qemu: uncaught target signal 6 (Aborted) - core dumped Aborted (core dumped) FAIL: chacha-poly1305
Is this about what you've expected? Then I'll look into it.
I expect anything calling the new functions _chacha_3core and _salsa_2core to fail. Easiest way to debug and fix is to run the test cases salsa-20-test and chacha-test, they're exercised by the functions test_chacha_core and test_salsa20_core I added to the tests recently.
Those tests have the advantage that they set the input to 0,1,2,...,15 (except one counter word is set to 0xffffffff, to test carry propagation), so it should be fairly easy to follow the permutations at the top of the functions. At least, I've found that very helpful when debugging the most recent neon and x86 code.
_chacha_3core interleaves three blocks with 4 separate state registers for each block, so big-endian fixes should be very similar to what you've done for _chacha_core (which I believe is still in working shape). _salsa20_2core, on the other hand, uses a bit different register allocation, each register holding corresponding words from two input blocks.
Any other branches I should try?
The new code has been pushed to master, so that's the most relevant branch.
Regards, /Niels
nisse@lysator.liu.se (Niels Möller) writes:
I'd like to do the same for x86_64.
I've now tried the same interleaving for salsa20 on x86_64, and it gives a 25% speedup on my laptop. Pushed to a new branch, x86_64-salsa20-2core.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se