Review points adjusted. See individual reviews on this separated branch:
https://git.lysator.liu.se/gut/nettle/commits/sha256-review
Points to be defined: 1) Fill copyright owner: https://git.lysator.liu.se/gut/nettle/commit/17e36f4e91d491ecc5a10ff1b567137...
2) keep or tweak --enable-ppc-vector: https://git.lysator.liu.se/gut/nettle/commit/d2371e2c4a469d6e9a3ffd9951662a4...
If I'm missing something, please advise
-----Original Message----- From: Gustavo Serra Scalet Sent: quarta-feira, 15 de março de 2017 09:27 To: 'Niels Möller' nisse@lysator.liu.se Cc: nettle-bugs@lists.lysator.liu.se Subject: RE: Up to 7x faster sha256 for ppc64le: Please review
-----Original Message----- From: Niels Möller [mailto:nisse@lysator.liu.se] Sent: terça-feira, 14 de março de 2017 17:54 To: Gustavo Serra Scalet gustavo.scalet@eldorado.org.br Cc: nettle-bugs@lists.lysator.liu.se Subject: Re: Up to 7x faster sha256 for ppc64le: Please review
I'm not familiar with ppc assembly, but some comments.
Are you using some special sha instructions (e.g., vshasigmaw), or only general simd instructions? Are they always available, or do we need some compile time and/or run time check?
The machine needs to be at least a POWER8 machine. As all previous PowerPC systems were big endian only, the endianness would be enough, but I guess a compile time check that would be more specific is __POWER8_VECTOR__ == 1 (available on gcc >= 5 and clang >= 3.7). Should I add that with a compile-flag like --x86-aesni ?
In machine.m4, the aliases like
define(<r15>, <15>)
doesn't seem very helpful. If the assembly convention is that plain numbers are used to identify registers, we can stick to that for non- symbolic references, and then define more meaningful symbolic names on top of that.
The problem is that it's not very readable as you can't tell registers and constants apart. E.g: $ grep lvx -m 1 sha256-compress.s lvx 9, 0, 3 $ objdump -d sha256-compress.o | grep lvx -m 1 50: ce 18 20 7d lvx v9,0,r3
The disassembler does differs the 3 from r3 for readability (but not the 0, which is a constant) even if the plain assembly is not indicating that. And I think that's a good thing.
And don't think like "lvx always has a constant as the second argument, you just need to know it" because it's only if it's zero: $ grep lvx sha256-compress.s | grep -m 1 1, lvx 29, 1, 0 $ objdump -d sha256-compress.o | grep lvx | grep -m 1 r1 11b8: ce 00 a1 7f lvx v29,r1,r0
So it can be confusing... and that happens for a bunch of instructions, therefore I'd like to leave it as it's even if the macros seem that dumb. Is that ok?
Also, I think it's good practice to use upper case for all m4 defines. E.g.,
define(<STATE>, 3)
Ok, I can manage to do that.
SAVE_NVOLATILE and RESTORE_NVOLATILE look a bit overkill for a single assembly function, but I guess they make sense if you plan more ppc assembly.
Sha512 will need different set of non-volatile registers so it'll be handy to have that macro.
For LOAD_H_VEC, what alignment would you need to not use load unaligned instructions? We could consider forcing larger alignment for struct sha256_ctx. Does it matter for performance?
16-bytes alignment. I didn't notice a significant change in performance but it made the development a little bit trickier...
The current treat-any-alignment-type version is: lvx $1, 0, state C load unaligned addi t0, state,16 lvsr t1, 0, t0 lvx $5, 0, t0 C load unaligned vperm $1, $5, $1, t1 C a = {a,b,c,d} addi t0, t0, 16 lvx aux, 0, t0 C load unaligned vperm $5, aux, $5, t1 C e = {e,f,g,h}
and if it were 16-bytes aligned, it would simply be: lvx $1, 0, state C a = {a,b,c,d} addi t0, state,16 lvx $5, 0, t0 C e = {e,f,g,h}
and we did it on all places (input, state and k) where vectors were being read/written from/to memory.
UPDATE_SHA_STATE looks surprisingly complicated. I guess it's alignment issues and that representation in registers is some
permutation of the
words as they appear in memory?
Mainly it's due to alignment. Also our vector implementation also spread one element at a vector on compression algorithm: +-------+----+----+----+----+ | | Vector Words: | | Name | #1 | #2 | #3 | #4 | +-------+----+----+----+----+ | Va | a | b | c | d | +-------+----+----+----+----+
But not when memory is read/written: +-------+----+----+----+----+ | | Vector Words: | | Name | #1 | #2 | #3 | #4 | +-------+----+----+----+----+ | Va | a | - | - | - | | Vb | b | - | - | - | | Vc | c | - | - | - | | Vd | d | - | - | - | +-------+----+----+----+----+
So there is some merging/unpacking happening after LOAD_H_VEC and on UPDATE_SHA_STATE (I should probably document this. Thanks for pointing it out)
Comments on the first uses of DEQUE are a bit confusing,
C Load a-h registers from the memory pointed by state DEQUE(a, b, c, d) DEQUE(e, f, g, h)
It's not any load from memory, right, but rather some permutation of the data?
Yes, I'll change DEQUE to UNPACK. I think it'll make more sense.
You unroll the compression function completely, 880 instructions just for the expansion of the ROUND macros. Are op-codes 32 bits, so that this is 3.5 KB code size (+ non-ROUND instructions)?
Every ppc64le instruction used is 4-bytes long.
This isn't terribly large, but unless you win significant performance from complete unrolling, I'd recommend unrolling only 8 or 16 rounds; that is likely enough to make loop overhead very small, and you use less of the instruction cache. (For comparison, the x86_64 versions also ends up at 3.5 KB, with 16 time unrolling).
I rolled from 16th to 64th round and I noticed no significant performance drop despite having more branches and branch misses. The instruction count of this new algorithm I verified by "objdump -d sha256-compress.o | wc -l" and it decreased from 1161 to 477.
I have nothing against this change. I'll send it along the other changes
And please add proper copyright headers. Are you the only author of this code?
My development group is involved on this task. I'll verify what to add as the copyright holder.
Thanks for the quick reply.