Emil Velikov emil.l.velikov@gmail.com writes:
As you can see from the second patch, nettle performance is a little low wrt OpenSSL - ~55% for sha1, and ~65% for sha2.
I think Nettle's assembly code for sha1 is quite old, and hasn'd been tuned for current processors with lots of parallelism.
If you analyze the data dependencies of sha1 carefully (I haven't looked into that for quite a while, though), I think the critical dependency path requires only 2-3 cycles per round, if instruction issue and execution can keep up with doing all the instructions not on the critical path in parallel.
If you look at round function,
C e += a <<< 5 + f( b, c, d ) + k + w; C b <<<= 30
what this means in practice is that we ought to identify the one of the a, b, c, d inputs which was updated in the previous round, and arrange the computation of the round function to use that input last, to minimize the chain of depending instructions from using that input until the new e is ready, since that will be the critical path of the round.
If you want to play with that, I would start experimenting with the C implementation and see what improvements can be made there, and then use a similar organization for the the assembly implementation(s).
Regards, /Niels