Now I've tried writing some x86 code. I do only the central sha1-compress function in assembler. I use m4 macros pretty heavily.
It doesn't quite work yet, but at least I get 118 MB/s, almost exactly the same speed as for the C md5 code. That's a 40% speedup, nice, but not as impressive as the arcfour code.
The function is 1244 instructions after macro expansion, and it processes 64 bytes of input, which is quite a lot of mangling per byte.
I *almost* fit everything in registers. The problem is how to compute f3(x,y,z) = (x & y) | (z & (x | y)), where x, y and z are in registers, and the result should be stored in my *only* temporary register.
I wonder how slow is it to use large immediate operands, like
addl $0x5A827999, %ebp
compared to an access via a register, like
addl 64(%esi), %ebp
One could shave of quite a few of them, with a minor change of the (internal) calling convention.
/ Niels Möller (vässar rödpennan)
Previous text:
2004-02-05 20:29: Subject: Nettle
I think it may be possible to do the sha1 compression function all in x86 registers. Five registers for the state, one for pointing to the input, and then one free temporary.
Benchmark for the C implementation of various hashes:
md2 (Update): 2.327MB/s md4 (Update): 171.846MB/s md5 (Update): 114.488MB/s sha1 (Update): 81.916MB/s sha256 (Update): 43.055MB/s
/ Niels Möller (vässar rödpennan)