The problem is how to compute f3(x,y,z) = (x & y) | (z & (x | y)), where x, y and z are in registers, and the result should be stored in my *only* temporary register.
That may be tricky if x, y and z all have to be preserved. (If any one of them can be overwritten, it's easy.)
/ Leif Stensson, Lysator
Previous text:
2004-02-06 00:30: Subject: Nettle
Now I've tried writing some x86 code. I do only the central sha1-compress function in assembler. I use m4 macros pretty heavily.
It doesn't quite work yet, but at least I get 118 MB/s, almost exactly the same speed as for the C md5 code. That's a 40% speedup, nice, but not as impressive as the arcfour code.
The function is 1244 instructions after macro expansion, and it processes 64 bytes of input, which is quite a lot of mangling per byte.
I *almost* fit everything in registers. The problem is how to compute f3(x,y,z) = (x & y) | (z & (x | y)), where x, y and z are in registers, and the result should be stored in my *only* temporary register.
I wonder how slow is it to use large immediate operands, like
addl $0x5A827999, %ebp
compared to an access via a register, like
addl 64(%esi), %ebp
One could shave of quite a few of them, with a minor change of the (internal) calling convention.
/ Niels Möller (vässar rödpennan)