Nettle

6 Feb 2004

      ...
The problem is how to compute f3(x,y,z) = (x & y) | (z & (x | y)),
where x, y and z are in registers, and the result should be stored
in my *only* temporary register.
That may be tricky if x, y and z all have to be preserved. (If any one
of them can be overwritten, it's easy.)
/ Leif Stensson, Lysator
Previous text:
...
2004-02-06 00:30:
Subject: Nettle

Now I've tried writing some x86 code. I do only the central
sha1-compress function in assembler. I use m4 macros pretty heavily.
It doesn't quite work yet, but at least I get 118 MB/s, almost exactly
the same speed as for the C md5 code. That's a 40% speedup, nice, but
not as impressive as the arcfour code.
The function is 1244 instructions after macro expansion, and it
processes 64 bytes of input, which is quite a lot of mangling per
byte.
I *almost* fit everything in registers. The problem is how to compute
f3(x,y,z) = (x & y) | (z & (x | y)), where x, y and z are in
registers, and the result should be stored in my *only* temporary
register.
I wonder how slow is it to use large immediate operands, like
addl	$0x5A827999, %ebp
compared to an access via a register, like
addl	64(%esi), %ebp
One could shave of quite a few of them, with a minor change of the
(internal) calling convention.
/ Niels Möller (vässar rödpennan)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Nettle