Re: GCM vs SHA1

22 Sep 2013


      nisse@lysator.liu.se (Niels Möller) writes:
...
sha1 needs 80 rounds to process 64 input bytes. Each round needs some 15
instruction, and with sufficient independence for reasonable instruction
level parallelism. So that's roughly 20 instructions per byte. Nettle's
current x86_64 code seems to get down to 7.7 cycles/byte on the machine
I have here, with some room for further optimization. openssl gets it
down a bit further, to 6 cycles/byte.
[...]
...
I think my attemps at assembly implementation, which haven't made much
progress, suffer from memxor overhead. [...] Around 9-10 cycles/byte
(benchmarking the top-level gcm_update). I think I'd need to
reimplement the gcm_hash function, inlining the xoring of the input
data.
I just checked in a rewrite. Down to 7.5 cycles/byte on the above Intel
machine. I have a per-block iteration which isn't completely unrolled,
but without any subroutine calls and only two simple subloops running 7
iterations each. I get it to 252 instructions, executed, or almost 16
instructions per byte. So 7.5 cycles means I get two instructions
executed per cycle, which is the best possible on this cpu.
For some reason, the current loop is slower on my AMD machine, at 8.4
cycles per byte. *If* scheduling could be improved to get the maximum of
3 instructions per cycle, I'd get down to 5.5 cycles/byte or so.
The file in questions is
https://git.lysator.liu.se/nettle/nettle/blobs/master/x86_64/gcm-hash8.asm.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid C0B98E26.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: GCM vs SHA1