Optimizing salsa20

7 Jul 2020


      I've written some new ARM Neon assembly for salsa20. See
https://gitlab.com/gnutls/nettle/-/commit/2ac58a1ce729a6cfe1d3703f4deb6da886...,
when configured with --enable-arm-neon.
It interleaves the processing of two blocks, which gives a speedup of
50% -- 100% on the ARM cores where I've tested it. Before merging, I
need to fix fat builds to use the new code on processors that support
it.
To make it work also on big-endian ARM, I'd need some help. (I think the
qemu-user package supports big-endian ARM, at least, it includes a
program named qemu-armeb. But I'm missing a cross compiler and cross
debugger).
I'd like to do the same for x86_64. And for chacha, it might give even
greater speedup to interleave processing of three blocks, which may be
possible since I think chacha needs fewer registers for temporaries.
For both x86_64 and ARM neon, the current code uses 128-bit wide
registers. Processors with 256-bit wide simd registers (at least 16 of
them) could do twice as many blocks at a time.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Optimizing salsa20