Taken from https://github.com/floodyberry/chacha-opt (released by author as public-domain-or-MIT, so I guess ok to borrow).
On x86/sse2 and x86_64: 80 to 100% faster.
Passes regression test on linux/debian/stretch x86 and x86_64, benchmarks ran with patched nettle-3.4.1 (due to abi break in 3.5). *Not* tested on win{32,64} (important: win64 ABI difference).
chacha-opt also contains x86{,_64}-{ssse3,avx{,2},xop} optimized code, but I don't have hardware to test (and there are difference in structure/argument layout that need to be corrected and tested).
WIP, will add armv6 and arm/neon a bit later.
P.S. Then I will probably take a look at poly1305 and likely try to borrow license-compatible arm asm somewhere (current nettle code is painfully slow); gcrypt is somewhat faster than nettle and LGPLv2.1+; cryptograms has definitely fastest crypto, but it is BSD-3-clause-or-GPLv2+; while it is, AFAIK, compatible with LGPL, but not sure if that's acceptable for nettle inclusion.
P.S. previously posted arm neon gcm patch breaks x86_64 compilation, will post trivial fix later.