I think it may be possible to do the sha1 compression function all in x86 registers. Five registers for the state, one for pointing to the input, and then one free temporary.
Benchmark for the C implementation of various hashes:
md2 (Update): 2.327MB/s md4 (Update): 171.846MB/s md5 (Update): 114.488MB/s sha1 (Update): 81.916MB/s sha256 (Update): 43.055MB/s
/ Niels Möller (vässar rödpennan)
Previous text:
2004-02-05 17:43: Subject: Nettle
If you are in a good hacking mood, any speedup in SHA1 would be most welcome. Since it is used in bittorrent, quite a few gigabytes runs through it.
/ Martin Nilsson (saturator)
Isn't there a better metric available? CPU cycles per bit or something?
/ Martin Nilsson (saturator)
Previous text:
2004-02-05 20:29: Subject: Nettle
I think it may be possible to do the sha1 compression function all in x86 registers. Five registers for the state, one for pointing to the input, and then one free temporary.
Benchmark for the C implementation of various hashes:
md2 (Update): 2.327MB/s md4 (Update): 171.846MB/s md5 (Update): 114.488MB/s sha1 (Update): 81.916MB/s sha256 (Update): 43.055MB/s
/ Niels Möller (vässar rödpennan)
I think it's adequate for the things I've used the benchmarks for:
1. Comparing hash performance between machines.
2. Comparing two implementations of the same hash function, both running on the same machine.
3. Comparing two different hash functions, both running on the same machine.
How do you use benchmark figures?
/ Niels Möller (vässar rödpennan)
Previous text:
2004-02-05 20:34: Subject: Nettle
Isn't there a better metric available? CPU cycles per bit or something?
/ Martin Nilsson (saturator)
It is comparing between machines that is difficult. Typically you would like to have a good baseline benchmark to normalize the rest with, like the dB scale for signal strength. hash-speed/memxor-speed or so.
/ Martin Nilsson (saturator)
Previous text:
2004-02-05 21:21: Subject: Nettle
I think it's adequate for the things I've used the benchmarks for:
Comparing hash performance between machines.
Comparing two implementations of the same hash function, both
running on the same machine.
- Comparing two different hash functions, both running on the same
machine.
How do you use benchmark figures?
/ Niels Möller (vässar rödpennan)
Why? It's easy enough to divide the numbers directly.
Say, machine A has N MB/sec, and machine B M MB/sec. Thus, A is N/M times as fast as B.
"normalizing" would gain absolutely nothing. Actually, you would loose relevance in the numbers, since we are trying to optimize a specific task, normalizing to some 'standard' machine would make it way harder to know how fast the hashing really is.
/ Per Hedbor ()
Previous text:
2004-02-05 21:31: Subject: Nettle
It is comparing between machines that is difficult. Typically you would like to have a good baseline benchmark to normalize the rest with, like the dB scale for signal strength. hash-speed/memxor-speed or so.
/ Martin Nilsson (saturator)
Now I've tried writing some x86 code. I do only the central sha1-compress function in assembler. I use m4 macros pretty heavily.
It doesn't quite work yet, but at least I get 118 MB/s, almost exactly the same speed as for the C md5 code. That's a 40% speedup, nice, but not as impressive as the arcfour code.
The function is 1244 instructions after macro expansion, and it processes 64 bytes of input, which is quite a lot of mangling per byte.
I *almost* fit everything in registers. The problem is how to compute f3(x,y,z) = (x & y) | (z & (x | y)), where x, y and z are in registers, and the result should be stored in my *only* temporary register.
I wonder how slow is it to use large immediate operands, like
addl $0x5A827999, %ebp
compared to an access via a register, like
addl 64(%esi), %ebp
One could shave of quite a few of them, with a minor change of the (internal) calling convention.
/ Niels Möller (vässar rödpennan)
Previous text:
2004-02-05 20:29: Subject: Nettle
I think it may be possible to do the sha1 compression function all in x86 registers. Five registers for the state, one for pointing to the input, and then one free temporary.
Benchmark for the C implementation of various hashes:
md2 (Update): 2.327MB/s md4 (Update): 171.846MB/s md5 (Update): 114.488MB/s sha1 (Update): 81.916MB/s sha256 (Update): 43.055MB/s
/ Niels Möller (vässar rödpennan)
The problem is how to compute f3(x,y,z) = (x & y) | (z & (x | y)), where x, y and z are in registers, and the result should be stored in my *only* temporary register.
That may be tricky if x, y and z all have to be preserved. (If any one of them can be overwritten, it's easy.)
/ Leif Stensson, Lysator
Previous text:
2004-02-06 00:30: Subject: Nettle
Now I've tried writing some x86 code. I do only the central sha1-compress function in assembler. I use m4 macros pretty heavily.
It doesn't quite work yet, but at least I get 118 MB/s, almost exactly the same speed as for the C md5 code. That's a 40% speedup, nice, but not as impressive as the arcfour code.
The function is 1244 instructions after macro expansion, and it processes 64 bytes of input, which is quite a lot of mangling per byte.
I *almost* fit everything in registers. The problem is how to compute f3(x,y,z) = (x & y) | (z & (x | y)), where x, y and z are in registers, and the result should be stored in my *only* temporary register.
I wonder how slow is it to use large immediate operands, like
addl $0x5A827999, %ebp
compared to an access via a register, like
addl 64(%esi), %ebp
One could shave of quite a few of them, with a minor change of the (internal) calling convention.
/ Niels Möller (vässar rödpennan)
They must be preserved.
/ Niels Möller (vässar rödpennan)
Previous text:
2004-02-06 13:05: Subject: Nettle
The problem is how to compute f3(x,y,z) = (x & y) | (z & (x | y)), where x, y and z are in registers, and the result should be stored in my *only* temporary register.
That may be tricky if x, y and z all have to be preserved. (If any one of them can be overwritten, it's easy.)
/ Leif Stensson, Lysator
Since the Pentium is based on the Z80, can't you use the EXX instruction to swap in a new set of registers?
/ Marcus Comstedt (ACROSS) (Hail Ilpalazzo!)
Previous text:
2004-02-06 14:25: Subject: Nettle
I use all 32 bits, all of which must be preserved.
/ Niels Möller (vässar rödpennan)
pike-devel@lists.lysator.liu.se