nisse@lysator.liu.se (Niels Möller) writes:
I think the current indexing can be simplified a bit, and it would make sense to unroll some of the shorter loops inside the round loop
I've now done some microoptimizations along those lines, doubling the performance on x86_64. I now get
Algorithm mode Mbyte/s sha224 update 68.82 sha256 update 68.83 sha384 update 105.55 sha512 update 105.56 sha3_256 update 29.04
(before I had 12 Mbyte/s).
Not sure how to go about assembly implementation, some, but far from all, of the steps can make use of SSE2 SIMD instructions.
Regards, /Niels