Martin Storsjö martin@martin.st writes:
--- a/x86_64/sha3-permute.asm +++ b/x86_64/sha3-permute.asm
BTW, this really file needs a rewrite. It runs much slower than the C version on some (or all?) AMD processors. Probably because the movq/movd between general registers and xmm registers have a large latency penalty. One would either need to move data via memory (maybe with a separate permute/rotate passworking with general registers and memory), or squeeze (almost) all state into the xmm registers, a bit like the arm neon sha3 code I wrote the other week.
Regards, /Niels