On Wed, Nov 25, 2020 at 3:22 AM Niels Möller nisse@lysator.liu.se wrote:
Maamoun TK maamoun.tk@googlemail.com writes:
On POWER9 I got the following benchmark result:
./configured: chacha encrypt 308.58 chacha decrypt 325.87 ./configured --enable-power-altivec "master branch": chacha encrypt 342.15 chacha decrypt 356.24 ./configured --enable-power-altivec "ppc-chacha-2core": chacha encrypt 648.97 chacha decrypt 648.00
It's gotten better with every further optimization on the core, great work.
Nice. So almost a factor 2 speedup from doing 2 blocks in parallel. I wonder if one can get close to another factor of two by going to 4 blocks. I hope to get the time to try that out, it should be fairly easy. (And if that does work out fine, maybe the code to do only 2 blocks could be removed).
Botan and Crypto++ uses 4x blocks. They usually hit about the same benchmark numbers.
For Crypto++ on GCC112, mixed message sizes:
* ChaCha20: 1200 MB/s, 2.9 cpb * ChaCha8: 2370 MB/s, 1.5 cpb
On an antique PowerMac G5:
* ChaCha20: 400 MB/s, 4.9 cpb * ChaCha8: 725 MB/s, 2.6 cpb
Bernstein's results are at https://bench.cr.yp.to/results-stream.html. He's showing 9 cpb on a 2006 IBM PowerPC. His implementation has a lot of opportunities for improvement. Also see https://cr.yp.to/streamciphers/timings/estreambench/submissions/salsa20/chac....
Jeff