nisse@lysator.liu.se (Niels Möller) writes:
I've written a first version of a gcm_hash for x86_64, using the pclmulqdq (carryless mul) instructions. With only a single block at a time, no interleaving, this gives to 4.3 GByte/s,
I've added proper config and fat setup and merged this. It could surely be improved further, but it's already much faster than the C version on processors that support these instructions.
I'm considering reorganizing the internal gcm functions. I think I'd like to have
void _nettle_ghash_set_key (struct gcm_key *gcm, const union nettle_block16 *key);
which sets the key (typically, the key block is zero encrypte using aes).
void _nettle_ghash_update (const struct gcm_key *key, union nettle_block16 *x, size_t length, const uint8_t *data);
where the input is complete blocks (padding done in the calling C code). Not sure if length should be block count or byte count.
void _nettle_ghash_digest (union nettle_block16 *digest, const union nettle_block16 *x);
xors the final state into the digest block. Main point of this function is that the implementation can chose internal byteorder, eliminating byteswaps at start and end of the update function.
Would perhaps be good to also delete the code for GCM_TABLE_BITS != 8, which isn't enabled and haven't been tested in years.
Regards, /Niels