Joachim Strömbergson joachim@secworks.se writes:
The big difference is that you update the variables in a QR twice during the QR processing, but the QR is more regular and can easily (easier) be scheduled with fewer register active in a given cycle.
Sounds like I have to look closer at the chacha spec to understand the details.
This is why I got a bit curious when you Niels stated: "And the particular change from 12 to 14 might add significant complexity to an optimized implementations with 4-way unrolling"
There was no deep thought behind that comment. It's just that if an assembly loop is unrolled 4 times, it simplifies the code if you can assume that that the number of rounds you need is always divisible by 4.
Now, current salsa20 implementation don't do that, _salsa20_core seems to support any even and non-zero number, for both C, x86_64 and arm neon. And there's no obvious gain in doing more unrolling. Could possibly make more sense for chacha, if each round is shorter in terms of number of instructions, cycles, and dependencies.
Regards, /Niels