Old ARM Neon code for salsa20 and chacha (was: Re: Release of Nettle-3.7?)

13 Jan 2021


      nisse@lysator.liu.se (Niels Möller) writes:
...
I've done a benchmark run of nettle-3.6 on the GMP "nanot2" system, with
a Cortex-A9 processor. The installed compiler is gcc-5.4 (a few years
old).
I choose Cortex-A9 for this test in attempt to reproduce my old numbers.
Even if it's probably not that relevant today.
...
So no big differences, but the neon code improves performance slightly
for chacha and sal20r12, and degrades performance sligtly for salsa20.
(The improvement for chacha actually seem significant, 13% speedup for
the Neon code).
This is all about the old single-block functions. The Neon code for both
salsa20 and chacha uses instructions operating on four 32-bit entries at
a time. But most instructions depend on the result of the previous
instruction, and latency of Neon instructions is pretty high. According
to measurements by Torbjörn Granlund, we typically have a latency of at
*least* two cycles (the only observed case of single-cycle latency was
for veor on A53 and A55).
In addition, two shift operations, even if they are independent
typically can't be issued in the same cycle, because they compete for a
single shift unit. So if we look at a single round (i.e., a quarter of a
qround) and annotate with latency numbers, i.e., the earliest cycle the
instruction can be started, and for simplicity assume that all
instructions but veor has a latency of 2 cycles, we get (this is for
salsa20):
vadd.i32 q8, q0, q3        0  t = x0 + x1
  vshl.i32 q9, q8, #7	     2  t <<<= 7
  vshr.u32 q8, q8, #25       3
  veor  q1, q1, q8           4  x1 ^= t
  veor  q1, q1, q9           5
vadd.i32 q8, q0, q1        6  (next QROUND)
So that's 6 cycles, for the same work as 12 scalar (32-bit) operations
(rotation is a single operation if done on scalar registers). So at
best, we can expect to get two 32-bit operations done per cycle. For
SIMD, that's not great at all.
For processors that can issue two instructions per cycle, and with
shorter latency, scalar code (i.e., code using only the general purpose
32-bit registers) could get more or less the same throughput. The scalar
code also gets the advantage that there's a handy rotate instruction
(instead of the shift right + shift left + combine used in the Neon
code), but it has the disadvantage of register shortage, and will need a
bunch of load and store instructions to access the state.
That doesn't quite explain why I saw a 45% speedup with Neon in 2013,
which has now disappeared. But maybe current gcc has good enough
instruction scheduling to produce code that can issue 2 instructions per
cycle on Cortex-A9 (which has quite limited out-of-order capabilities),
and gcc back then couldn't do that?
So what's next? Should the old code just be deleted?
With the new 2-way or 3-way functions, performance of the single-block
functions isn't that critical, so deletion may be ok even if it causes
some small regression on some processors (e.g., single-block chacha
getting 13% slower on the old Cortex-A9)
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Old ARM Neon code for salsa20 and chacha (was: Re: Release of Nettle-3.7?)