nisse@lysator.liu.se (Niels Möller) writes:
I've done a benchmark run of nettle-3.6 on the GMP "nanot2" system, with a Cortex-A9 processor. The installed compiler is gcc-5.4 (a few years old).
I choose Cortex-A9 for this test in attempt to reproduce my old numbers. Even if it's probably not that relevant today.
So no big differences, but the neon code improves performance slightly for chacha and sal20r12, and degrades performance sligtly for salsa20.
(The improvement for chacha actually seem significant, 13% speedup for the Neon code).
This is all about the old single-block functions. The Neon code for both salsa20 and chacha uses instructions operating on four 32-bit entries at a time. But most instructions depend on the result of the previous instruction, and latency of Neon instructions is pretty high. According to measurements by Torbjörn Granlund, we typically have a latency of at *least* two cycles (the only observed case of single-cycle latency was for veor on A53 and A55).
In addition, two shift operations, even if they are independent typically can't be issued in the same cycle, because they compete for a single shift unit. So if we look at a single round (i.e., a quarter of a qround) and annotate with latency numbers, i.e., the earliest cycle the instruction can be started, and for simplicity assume that all instructions but veor has a latency of 2 cycles, we get (this is for salsa20):
vadd.i32 q8, q0, q3 0 t = x0 + x1 vshl.i32 q9, q8, #7 2 t <<<= 7 vshr.u32 q8, q8, #25 3 veor q1, q1, q8 4 x1 ^= t veor q1, q1, q9 5
vadd.i32 q8, q0, q1 6 (next QROUND)
So that's 6 cycles, for the same work as 12 scalar (32-bit) operations (rotation is a single operation if done on scalar registers). So at best, we can expect to get two 32-bit operations done per cycle. For SIMD, that's not great at all.
For processors that can issue two instructions per cycle, and with shorter latency, scalar code (i.e., code using only the general purpose 32-bit registers) could get more or less the same throughput. The scalar code also gets the advantage that there's a handy rotate instruction (instead of the shift right + shift left + combine used in the Neon code), but it has the disadvantage of register shortage, and will need a bunch of load and store instructions to access the state.
That doesn't quite explain why I saw a 45% speedup with Neon in 2013, which has now disappeared. But maybe current gcc has good enough instruction scheduling to produce code that can issue 2 instructions per cycle on Cortex-A9 (which has quite limited out-of-order capabilities), and gcc back then couldn't do that?
So what's next? Should the old code just be deleted?
With the new 2-way or 3-way functions, performance of the single-block functions isn't that critical, so deletion may be ok even if it causes some small regression on some processors (e.g., single-block chacha getting 13% slower on the old Cortex-A9)
Regards, /Niels