I've done some further updates.
* I've introduced a specialized function gcm_gf_add, used instead of memxor when blocks are aligned, and also avoiding looping overhead and (if it is inlined, which I think it should be) call overhead. Current performance on x86_64 is 28.5 cycles / byte with 4-bit tables (current default), and 8.5 cycles / byte with 8-bit tables. Close to a factor of two improvement.
* I've introduced a union gcm_block, which is used internally to ensure that the gf elements have the right alignment. Tested on sparc32 and sparc64 (big endian and pickier about alignment).
* I've split out the message-independent state to a separate struct gcm_key, which needs to be passed as argument to all gcm functions.
* I've added a struct gcm_aes_ctx and related functions. This is an all-in-one context, including all of the cipher context, the hashing subkey, and message state.
* I've added support for IV:s of arbitrary lengths, and added the rest of the testcases from http://www.cryptobarn.com/papers/gcm-spec.pdf
* I've simplified the configuration of internal multiplication routines a bit, and rewritten the table generation to use just shifts and adds (as suggested in http://www.cryptobarn.com/papers/gcm-spec.pdf), which means that when tables are used, there's no need to keep the bitwise multiplication function which doesn't use tables.
I think the code is stabilizing a bit now.
One naming question: Should gcm_aes_auth be renamed to gcm_aes_update, for consistency with other hash and mac functions? I'm tempted to do that.
Regards, /Niels