nisse@lysator.liu.se (Niels Möller) writes:
To get to understand the algorithm better, I'm writing a toy implementation. I'll let you know when I have a poly128-routine which passes the tests.
Done now. Available at http://www.lysator.liu.se/~nisse/misc/umac-0.1.tar.gz. The code is not seriously optimized, and it supports only 32-bit tags. The functions in poly128.c might be useful.
I'm a bit worried about testing; the test vectors in RFC4418 seem inadequate to get confidence in an implementation. I include the README file (an extended version of the notes I mailed earlier) below.
Regards, /Niels
This is a "toy implementation" of UMAC, which I wrote to understand the algorithm, and try to figure out the proper interfaces for implementing it in Nettle. The current code only does the smallest tag size, 32 bits.
Some lessons learned:
* I was initially confused about endianness in the spec. It turns out the data being hashed is read into 32-bit integers using little-endian byteorder. *Everything* else uses big-endian conventions.
* Testing is poor. There are a lot of corner cases in the algorithm, primarily in the layer two hashing. First, there are a bunch of conditions depending on the message length (no POLY, only POLY64, combination of POLY64 and POLY128), and then there's the "marker" words in the POLY algorithm which are very unlikely for random input and hence a bit tricky to get properly tested.
So to get confidence in any UMAC implementation, it would be highly desirable with a set of known test vectors which provide for complete code coverage, but I haven't found any.
* The nh function is expected to be the main work-horse (but I haven't profiled anything). So it's a candidate for assembly implementation. To avoid having to read (and deal with unaligned data and endianness) more than once, it would make sense with an nh function which gets the tag length as an argument, and does all needed processing of a single block. An alternative is to copy the input block into an aligned uint32_t array before calling nh, but I think it's better not to make any extra copy of the input, and just process it on the fly as it is read.
* Optimization of the layer two hash, in particular the poly64 and poly128 operations, may be important. Profiling will tell.
* Since I have been working with side-channel silence recently, it seems natural to try to make the POLY function silent. On the other hand, I'm not sure what the threats are. If the MAC is applied to a secret message, we may leak some information about the message, I guess? Is it a problem worth bothering about?
* One can expect the POLY128 operations, used for large messages, to be roughly four times slower than POLY64 (quadratic multiplication). Benchmarking is needed to see if it makes a noticeable difference, layer two hashing may still be a small fraction of the work done in layer one.
About the proper interface (primarily for Nettle).
* It's natural to provide the nonce at the time the digest (or "tag") is extracted, since (i) that's when it is needed, and (ii) for other macs, that's the only function which is called precisely once per message. (Current code doesn't do that).
* It's desirable to keep the underlying AES cipher separate from the rest of the UMAC code, and provide it only for key setup. But that doesn't work, since nonce handling also needs the block cipher. I think it would be possible to still use some abstract KDF to provide the key, but that seems a bit pointless if we hard-code use of AES anyway.
* I'm leaning towards having separate types and functions for UMAC-AES-32, UMAC-AES-64, UMAC-AES-96 and UMAC-AES-128, which would all use common primitives for layer 1, 2 and 3 hashing. The primitives should be independent of AES. It would then be easy to define UMAC with other ciphers and KDFs if ever needed (and as far as I understand, a *block* cipher shouldn't be strictly necessary, since it's essentially used in CTR mode anyway, e.g., UMAC-SALSA20 should be possible).
Or we can have a common UMAC-AES type, with the digest size as an argument to key setup. For smaller tag sizes, we then allocate space for more subkeys than needed. This is a drawback shared with, e.g., the current AES code in Nettle. There's also some additional non-key state for larger tag sizes (running variables for the polynomials), but that's a small issue.
* Speaking of storage micro-optimizations, it might make sense to have poly64 and poly128 share storage, or to combine them into an "layer two hash" abstraction.
* To improve performance for small messages, it would make some sense to compute the subkeys lazily, and not compute keys which aren't going to be used.
* In some ways it would be nice and proper to separate the subkeys, which are constant while messages are being processed, from the per-message state (applies also to the HMAC code). One potential use would be to have a single key and several threads using it to process messages. Also not sure if that is worth bothering about. If we go that way, it more or less rules out lazy computation of subkeys.
LocalWords: UMAC endianness endian byteorder nh uint AES KDF KDFs LocalWords: subkeys HMAC