nisse@lysator.liu.se (Niels Möller) writes:
sha1 needs 80 rounds to process 64 input bytes. Each round needs some 15 instruction, and with sufficient independence for reasonable instruction level parallelism. So that's roughly 20 instructions per byte. Nettle's current x86_64 code seems to get down to 7.7 cycles/byte on the machine I have here, with some room for further optimization. openssl gets it down a bit further, to 6 cycles/byte.
[...]
I think my attemps at assembly implementation, which haven't made much progress, suffer from memxor overhead. [...] Around 9-10 cycles/byte (benchmarking the top-level gcm_update). I think I'd need to reimplement the gcm_hash function, inlining the xoring of the input data.
I just checked in a rewrite. Down to 7.5 cycles/byte on the above Intel machine. I have a per-block iteration which isn't completely unrolled, but without any subroutine calls and only two simple subloops running 7 iterations each. I get it to 252 instructions, executed, or almost 16 instructions per byte. So 7.5 cycles means I get two instructions executed per cycle, which is the best possible on this cpu.
For some reason, the current loop is slower on my AMD machine, at 8.4 cycles per byte. *If* scheduling could be improved to get the maximum of 3 instructions per cycle, I'd get down to 5.5 cycles/byte or so.
The file in questions is https://git.lysator.liu.se/nettle/nettle/blobs/master/x86_64/gcm-hash8.asm.
Regards, /Niels