I think I will leave the curve25519 and eddsa code for now, even though there are several important optimizations left to do (see the just updated http://www.lysator.liu.se/~nisse/nettle/plan.html).
I think it's getting time to do fat binaries. To make progress, I think it's best to start with something simple, relying on __attribute__((constructor) and/or __attribute__(ifunc ...)).
For the case of memxor (where on x86_64, the fat binary mechanism needs to select between sse2 and non-sse2 code), I'm also considering some reorganization:
* Use smaller assembly routines doing one case each, and let the main entry point always be C code which can sort out the different cases and handle bytes at the beginning and end of the buffer.
* Fix the cases where the current current code reads a few bytes outside of input buffers (but luckily without crossing word boundaries, iirc).
* Add some internal entry points, for cases where alignment is known by the caller.
I think some additional overhead is acceptable for the cases of small badly aligned buffers, if one can gain cleaner or more efficient handling of the other cases.
Regards, /Niels