I'm rewriting the cast128 key schedule, to get rid of false warnings, and avoid lots of conditions, and to separate the rotation and the mask subkeys.
Then I noticed a portability problem with the rotation macros,
#define ROTL32(n,x) (((x)<<(n)) | ((x)>>(32-(n))))
For n == 0, this will work on most machines, but it's not portable, since x >> 32 gives undefined behaviour according to the C spec (when x is a 32-bit type). (On typical hardware, the result of x >> 32 will be either x or 0, and the rotation macro gives the intended result in either case).
In most of nettle, there's no problem, because rotation counts are constant and non-zero.
cast128 is an exception, with key-dependent rotation counts, which can well be zero (don't know if that's exercised by the test suite, though).
A fix is to redefine the macro as
#define ROTL32(n,x) (((x)<<(n)) | ((x)>>((-(n))&31)))
It should make no difference when n is constant, but for cast128, this portability fix makes the code almost 20% slower. Apparently, gcc, doens't recognize this as a rotate. I just filed a bug report at
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=57157
Is there any other trick I'm missing, which is portable C but which doesn't slow it down when compiled with gcc?
Regards, /Niels