Gustavo Serra Scalet gustavo.scalet@eldorado.org.br writes:
I coded a high performance sha256 algorithm for ppc64le:
https://git.lysator.liu.se/gut/nettle/commit/a8facb03a69787a93c91b426f32a0be...
Cool!
Tests were performed with different files and comparing it against the original C implementation by using the Ubuntu's 16.10 libnettle.so.6 by using the following code: https://gist.github.com/gut/9622d4535a9e3f9ea3b0ded2762d4b28
You could also use the nettle-hash program.
I'm not familiar with ppc assembly, but some comments.
Are you using some special sha instructions (e.g., vshasigmaw), or only general simd instructions? Are they always available, or do we need some compile time and/or run time check?
In machine.m4, the aliases like
define(<r15>, <15>)
doesn't seem very helpful. If the assembly convention is that plain numbers are used to identify registers, we can stick to that for non-symbolic references, and then define more meaningful symbolic names on top of that. Also, I think it's good practice to use upper case for all m4 defines. E.g.,
define(<STATE>, 3)
SAVE_NVOLATILE and RESTORE_NVOLATILE look a bit overkill for a single assembly function, but I guess they make sense if you plan more ppc assembly.
For LOAD_H_VEC, what alignment would you need to not use load unaligned instructions? We could consider forcing larger alignment for struct sha256_ctx. Does it matter for performance?
UPDATE_SHA_STATE looks surprisingly complicated. I guess it's alignment issues and that representation in registers is some permutation of the words as they appear in memory?
Comments on the first uses of DEQUE are a bit confusing,
C Load a-h registers from the memory pointed by state DEQUE(a, b, c, d) DEQUE(e, f, g, h)
It's not any load from memory, right, but rather some permutation of the data?
You unroll the compression function completely, 880 instructions just for the expansion of the ROUND macros. Are op-codes 32 bits, so that this is 3.5 KB code size (+ non-ROUND instructions)? This isn't terribly large, but unless you win significant performance from complete unrolling, I'd recommend unrolling only 8 or 16 rounds; that is likely enough to make loop overhead very small, and you use less of the instruction cache. (For comparison, the x86_64 versions also ends up at 3.5 KB, with 16 time unrolling).
And please add proper copyright headers. Are you the only author of this code?
Regards, /Niels