Maamoun TK maamoun.tk@googlemail.com writes:
Great work. The implementation looks fine, I like the idea of using -16 instead of 16 for rotating because vspltisw is limited to (-16 to 15) and vrlw picks the low-order 5 bits which is the same for both -16 and 16.
I picked up that trick from Torbjörn Granlund's code.
BTW this implementation should work as is on big-endian mode without any hassle because lxvw4x/stxvw4x are endianness aware of loading/storing word values.
I've pushed it to a branch ppc-chacha-core. But it fails on big-endian powerpc64, see https://gitlab.com/gnutls/nettle/-/jobs/758348866.
And it looks like the error message from the first failing chacha test is truncated, which makes me suspect some error in function prologue or register usage, resulting in some invalid state when the function returns.
Comparing to your assembly code, I don't set FUNC_ALIGN, is that a problem?
Regards, /Niels