Nikos Mavrogiannopoulos nmav@gnutls.org writes:
On 02/06/2011 12:08 AM, Niels Möller wrote:
It remains to see how much table space and/or assembly hacking is needed to get reasonable performance.
There is a special instruction for that on new intel and AMD CPUs... http://software.intel.com/en-us/articles/intel-carry-less-multiplication-ins... http://en.wikipedia.org/wiki/CLMUL_instruction_set
Interesting. I haven't played with any such special instructions (even if it ought to make a bit of difference also for aes).
Anyway, I've been hacking a bit on the C-implementation over the day, and the galois hashing (gmac) is now 18 times(!) faster. Summary of changes:
Original unoptimized code:
Algorithm mode Mbyte/s cycles/byte cycles/block gmac auth 1.49 829.83 13277.27
Optimized rshift, rewritten to use word-sized operations:
Algorithm mode Mbyte/s cycles/byte cycles/block gmac auth 4.62 268.14 4290.23
Optimized gf_mul, rewritten to use separate byte and bit loops:
Algorithm mode Mbyte/s cycles/byte cycles/block gmac auth 5.46 227.18 3634.90
Moved reduction into shift function:
Algorithm mode Mbyte/s cycles/byte cycles/block gmac auth 6.79 182.69 2923.02
Introduced 4-bit tables:
Algorithm mode Mbyte/s cycles/byte cycles/block gmac auth 27.14 45.68 730.82
Remaining tricks:
* Try 8-bit tables (which increases storage need a lot, the modest 4-bit tables need only 256 additional bytes per key, for gf_mul, and a 32-byte constant table for gf_shift. Extending to 8 bits makes both tables 16 times larger).
* See if it makes sense to write any assembler for the hashing function.
* Various smaller microoptimizations, like a public memxor-variant for when areas are known to be word-aligned. Or inline that xor:ing.
Regards, /Niels