nisse@lysator.liu.se (Niels Möller) writes:
Radix 64: 2.75 GByte/s, i.e., faster than current x86_64 asm version.
And I've now tried the same method for the x86_64 implementation. See attached file + needed patch to asm.m4. This gives 2.9 GByte/s.
I'm not entirely sure cycle numbers are accurate, with clock frequence not being fixed. I think the machine runs bechmarks at 2.1GHz, and then this corresponds to 11.5 cycles per block, 0.7 cycles per byte, 4 instructions per cycle, 0.5 multiply instructions per cycle.
This laptop has an AMD zen2 processor, which should be capable of issuing four instructions per cycle and complete one multiply instruction per cycle (according to https://gmplib.org/~tege/x86-timing.pdf).
This seems to indicate that on this hardware, speed is not limited by multiplier throughput, instead, the bottleneck is instruction decoding/issuing, with max four instructions per cycle.
Benchmarked also on my other nearby x86_64 machine (intel broadwell processor). It's faster there too (from 1.4 GByte/s to 1.75). I'd expect it to be generally faster, and have pushed it to the master-updates branch.
I haven't looked that carefully at what the old code was doing, but I think the final folding for each block used a multiply instruction that then depends on the previous ones for that block, increasing the per block latency. With the new code, all multiplies done for a block are independent of each other.
Regards, /Niels