Nettle

5 Feb 2004


      Now I've tried writing some x86 code. I do only the central
sha1-compress function in assembler. I use m4 macros pretty heavily.
It doesn't quite work yet, but at least I get 118 MB/s, almost exactly
the same speed as for the C md5 code. That's a 40% speedup, nice, but
not as impressive as the arcfour code.
The function is 1244 instructions after macro expansion, and it
processes 64 bytes of input, which is quite a lot of mangling per
byte.
I *almost* fit everything in registers. The problem is how to compute
f3(x,y,z) = (x & y) | (z & (x | y)), where x, y and z are in
registers, and the result should be stored in my *only* temporary
register.
I wonder how slow is it to use large immediate operands, like
addl	$0x5A827999, %ebp
compared to an access via a register, like
addl	64(%esi), %ebp
One could shave of quite a few of them, with a minor change of the
(internal) calling convention.
/ Niels Möller (vässar rödpennan)
Previous text:
...
2004-02-05 20:29:
Subject: Nettle

I think it may be possible to do the sha1 compression function all in
x86 registers. Five registers for the state, one for pointing to the
input, and then one free temporary.
Benchmark for the C implementation of various hashes:
     md2 (Update): 2.327MB/s
     md4 (Update): 171.846MB/s
     md5 (Update): 114.488MB/s
    sha1 (Update): 81.916MB/s
  sha256 (Update): 43.055MB/s


/ Niels Möller (vässar rödpennan)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Nettle