I've done some hacking to implement the skein hash function (one of the sha3 candidates, see https://en.wikipedia.org/wiki/Skein_%28hash_function%29). Development is done on the "skein" branch of the repo.
At low-level it has some similarities to salsa20 and chacha: An input block is mangled in a perfectly invertible way using only add, xor and rotation, followed by an addition of the original input, which destroys invertibility. And then this primitive is used to construct a hash function.
I've implemented skein256, both for portable C and x86_64 assembly. The current implementation is written for small footprint in both code and data, it could perhaps be made a little faster with more unrolling.
On x86_64, it's slightly slower than sha1, slightly faster than sha512, and considerably faster than sha256 and sha3 (and these relations are likely to be similar on all platforms with a reasonable number of 64-bit registers). Performance is a bit behind the numbers reported in the skein paper http://www.skein-hash.info/sites/default/files/skein1.3.pdf; not sure if the difference is due to unrolling, different benchmarking machines, or additional implementation tricks. (On my current x86_64 machines, benchmarking using the nettle-enchmark is not very accurate, numbers vary quite a bit from one run to the next).
And it's a pity there's no easy way to rotate different pieces of an xmm register by different counts; that makes it hard to get any advantage of the parts of the skein/threefish round which do fit SIMD. So the design is a lot less SIMD-friendly than chacha.
skein512 is in progress; there it's not possible to fit all inputs in registers (unless one can make good use of SIMD registers in some way), and I haven't yet figured out a good way to organize it for performance.
Skein can be used in additional ways than a plain hash function, e.g., as a mac, or key derivation function. I don't plan to implement such additional skein features until there's some clear use case for them.
Regards, /Niels