Niels Möller nisse@lysator.liu.se writes:
It could likely be speedup further by processing 2, 3 or 4 blocks in parallel.
I've given 2 blocks in parallel a try, but not quite working yet. My work-in-progress code below.
I've got it into working shape now, at least for little-endian. See https://git.lysator.liu.se/nettle/nettle/-/blob/ppc-chacha-2core/powerpc64/p...
Next steps:
1. Fix it to work also for big-endian,
2. Wire it up for fat builds.
3. Try out if 4-way gives additional speedup.
Benchmarking is appreciated. Compare the master branch to the ppc-chacha-2core branch, configured with --enable-power-altivec, and run ./examples/nettle-benchmark chacha.
Regards, /Niels