nisse@lysator.liu.se (Niels Möller) writes:
For processors that can issue two instructions per cycle, and with shorter latency, scalar code (i.e., code using only the general purpose 32-bit registers) could get more or less the same throughput. The scalar code also gets the advantage that there's a handy rotate instruction (instead of the shift right + shift left + combine used in the Neon code), but it has the disadvantage of register shortage, and will need a bunch of load and store instructions to access the state.
That doesn't quite explain why I saw a 45% speedup with Neon in 2013, which has now disappeared. But maybe current gcc has good enough instruction scheduling to produce code that can issue 2 instructions per cycle on Cortex-A9 (which has quite limited out-of-order capabilities), and gcc back then couldn't do that?
So what's next? Should the old code just be deleted?
With the new 2-way or 3-way functions, performance of the single-block functions isn't that critical, so deletion may be ok even if it causes some small regression on some processors (e.g., single-block chacha getting 13% slower on the old Cortex-A9)
I've made a branch with deletion of this code, "delete-1-way-neon". Any comments before I merge to master?
Regards, /Niels