Re: Old ARM Neon code for salsa20 and chacha

28 Jan 2021


      nisse@lysator.liu.se (Niels Möller) writes:
...
For processors that can issue two instructions per cycle, and with
shorter latency, scalar code (i.e., code using only the general purpose
32-bit registers) could get more or less the same throughput. The scalar
code also gets the advantage that there's a handy rotate instruction
(instead of the shift right + shift left + combine used in the Neon
code), but it has the disadvantage of register shortage, and will need a
bunch of load and store instructions to access the state.
That doesn't quite explain why I saw a 45% speedup with Neon in 2013,
which has now disappeared. But maybe current gcc has good enough
instruction scheduling to produce code that can issue 2 instructions per
cycle on Cortex-A9 (which has quite limited out-of-order capabilities),
and gcc back then couldn't do that?
So what's next? Should the old code just be deleted?
With the new 2-way or 3-way functions, performance of the single-block
functions isn't that critical, so deletion may be ok even if it causes
some small regression on some processors (e.g., single-block chacha
getting 13% slower on the old Cortex-A9)
I've made a branch with deletion of this code, "delete-1-way-neon". Any
comments before I merge to master?
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Old ARM Neon code for salsa20 and chacha