I have now read the UMAC spec (RFC 4418) a bit more carefully. I haven't yet read Niko's code (or any other code, for that matter). Some thoughts:
o I don't like the way endian conversion is done in the spec. I'd prefer to think about the various functions as operating on arrays of 32-bit words, and implementation should use integer types of the right size to get correct alignment etc.
o The "NH" function looks like a candidate for for assembly implementation. I don't know if there's anything else in the algorithm which really is performance critical? (And here we get a contradiction to point (1), it may be best for performance to have the NH function get the unaligned byte array as input, do be able to use assembly tricks when reading it into integers. Anyway, we should really avoid byte arrays in the internal interfaces between L1/L2/L3).
o *Maybe* optimization of the L2 and L3 hashes will be important. Profiling is needed, I guess, and they should be optimized *after* L1/NH.
o Since I have been work with side-channel silence recently, it seems natural to try to make the POLY function silent, On the other hand, I'm not sure what the threats are. If the MAC is applied to a secret message, we may leak some information about the message, I guess?
o I think we ought to handle large messages correctly, which means we need the POLY function also over the 128-bit prime. Performance is not terribly important, at least not initially.
o I'm not sure exactly how the building blocks fit together, but we should strive for pipelining. When we have the first message block M_1, apply L1 to that block, then apply L2 and L3 to the output as soon as possible. And for the larger tag lengths, also try to make that looping inside the loop processing the sequence of message blocks, so we can discard M_1 before starting to work on the next block M_2.
Regards, /Niels