nisse@lysator.liu.se (Niels Möller) writes:
Nikos noted (off list) that Nettle's gcm hashing is slower than sha1. Which seems contrary to what's expected.
If people expect that gcm is faster than ha1, I'm curious how they reaon.
sha1 needs 80 rounds to process 64 input bytes. Each round needs some 15 instruction, and with sufficient independence for reasonable instruction level parallelism. So that's roughly 20 instructions per byte. Nettle's current x86_64 code seems to get down to 7.7 cycles/byte on the machine I have here, with some room for further optimization. openssl gets it down a bit further, to 6 cycles/byte.
gcm hashing is one gf2 multiplication, which needs 16 instructions. Dependencies doesn't look too bad. It seems posible to get this routine alone to run at about 6 cycles/byte. On top of that is the xoring the input data, which should be less than a cycle/byte, a bit depending on the alignment of the callers' data buffer.
I think my attemps at assembly implementation, which haven't made much progress, suffer from memxor overhead. I have so far only done the gf2 multiplication in assembly, but that gives little, if any, gain over the code generated by gcc. Around 9-10 cycles/byte (benchmarking the top-level gcm_update). I think I'd need to reimplement the gcm_hash function, inlining the xoring of the input data.
But I doubt it's possible to get gcm down to 6 cycles/byte. And then it still wouldn't be faster than a well optimized sha1. Machines with a fast shld (two-register shift) could have a couple of instructino and maybe a cycle or so per byte.
And then there's specialized instruction, like pclmul, which should be helpful. I haven't yet looked into that.
Regards, /Niels