On 02/06/2011 12:08 AM, Niels Möller wrote:
The unoptimized GF(2^128) multiply function really is awfully slow. On x86_64, gmac takes 830 cycles/byte! We can compare to the sha functions, where sha1, sha256 and sha512 take respectively 8, 18 and 12 cycles/byte, so the current code is two orders of magnitude slower than hmac-sha1. It remains to see how much table space and/or assembly hacking is needed to get reasonable performance.
There is a special instruction for that on new intel and AMD CPUs... http://software.intel.com/en-us/articles/intel-carry-less-multiplication-ins... http://en.wikipedia.org/wiki/CLMUL_instruction_set
Unfortunately I don't have anything close to those cpus...
regards, Nikos