Re: Optimizing salsa20

9 Jul 2020


      I would like to help but I have no clue or experience with ARM NEON, sorry.
regards,
Mamone
On Tue, Jul 7, 2020 at 5:46 PM Niels Möller nisse@lysator.liu.se wrote:
...
I've written some new ARM Neon assembly for salsa20. See
https://gitlab.com/gnutls/nettle/-/commit/2ac58a1ce729a6cfe1d3703f4deb6da886...
,
when configured with --enable-arm-neon.
It interleaves the processing of two blocks, which gives a speedup of
50% -- 100% on the ARM cores where I've tested it. Before merging, I
need to fix fat builds to use the new code on processors that support
it.
To make it work also on big-endian ARM, I'd need some help. (I think the
qemu-user package supports big-endian ARM, at least, it includes a
program named qemu-armeb. But I'm missing a cross compiler and cross
debugger).
I'd like to do the same for x86_64. And for chacha, it might give even
greater speedup to interleave processing of three blocks, which may be
possible since I think chacha needs fewer registers for temporaries.
For both x86_64 and ARM neon, the current code uses 128-bit wide
registers. Processors with 256-bit wide simd registers (at least 16 of
them) could do twice as many blocks at a time.
Regards,
/Niels
--
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

nettle-bugs mailing list
nettle-bugs@lists.lysator.liu.se
http://lists.lysator.liu.se/mailman/listinfo/nettle-bugs

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Optimizing salsa20