Hi,
I coded a high performance sha256 algorithm for ppc64le:
https://git.lysator.liu.se/gut/nettle/commit/a8facb03a69787a93c91b426f32a0be...
Tests were performed with different files and comparing it against the original C implementation by using the Ubuntu's 16.10 libnettle.so.6 by using the following code: https://gist.github.com/gut/9622d4535a9e3f9ea3b0ded2762d4b28
The results I got by using an ubuntu-16.10 iso as an input (around 700mb):
$ time ./sha256-test ~/ubuntu-16.10-server-ppc64el.iso d14bdb413ea6cdc8d9354fcbc37a834b7de0c23f992deb0c6764d0fd5d65408e
real 0m14.607s user 0m13.988s sys 0m0.616s $ export LD_LIBRARY_PATH=~/nettle # link with user and not system lib $ time ./sha256-test ~/ubuntu-16.10-server-ppc64el.iso d14bdb413ea6cdc8d9354fcbc37a834b7de0c23f992deb0c6764d0fd5d65408e
real 0m2.242s user 0m2.052s sys 0m0.188s
I also plan on doing the sha512 if I'm heading to the right direction.
Regards, Gustavo Serra Scalet
-----Original Message----- From: Nikos Mavrogiannopoulos [mailto:n.mavrogiannopoulos@gmail.com] Sent: terça-feira, 28 de fevereiro de 2017 06:07 To: Niels Möller nisse@lysator.liu.se Cc: Gustavo Serra Scalet gustavo.scalet@eldorado.org.br; nettle- bugs@lists.lysator.liu.se Subject: Re: Is the gitlab being used or not?
On Tue, Feb 28, 2017 at 7:48 AM, Niels Möller nisse@lysator.liu.se wrote:
Gustavo Serra Scalet gustavo.scalet@eldorado.org.br writes:
I noticed there is no merging/update activities on the gitlab
website:
https://git.lysator.liu.se/nettle/nettle/merge_requests?scope=all&sta te=all
I wonder if it's used.
It's used mainly for the plain git hosting. I rarely use the more advance features, and almost never access the web interface.
Would it be possible to disable then the merge request option in the gitlab interface? There are already 5 patches there.
Most likely you need to go to -> Edit project and disable the issue tracker and the merge commits for everyone (at least that's the process in the new gitlab).
The reason for asking is that I'm willing to integrate a ppc64le-specific implementation using SIMD registers for the SHA-2 algorithm so I would create an Issue to discuss it and then submitting the code as a Merge Request. But only if that's the way for upstreaming code on nettle. If not, please advise if this list can be used.
Discussion should take place on this list. You can mail patches here, or create a merge request on gitlab. But in the latter case, please send a note here too, otherwise it might be overlooked.
I attach a patch to document that process. I use CONTRIBUTION.md show it shows prominently on the gitlab interfaces.
regards, Nikos
Gustavo Serra Scalet gustavo.scalet@eldorado.org.br writes:
I coded a high performance sha256 algorithm for ppc64le:
https://git.lysator.liu.se/gut/nettle/commit/a8facb03a69787a93c91b426f32a0be...
Cool!
Tests were performed with different files and comparing it against the original C implementation by using the Ubuntu's 16.10 libnettle.so.6 by using the following code: https://gist.github.com/gut/9622d4535a9e3f9ea3b0ded2762d4b28
You could also use the nettle-hash program.
I'm not familiar with ppc assembly, but some comments.
Are you using some special sha instructions (e.g., vshasigmaw), or only general simd instructions? Are they always available, or do we need some compile time and/or run time check?
In machine.m4, the aliases like
define(<r15>, <15>)
doesn't seem very helpful. If the assembly convention is that plain numbers are used to identify registers, we can stick to that for non-symbolic references, and then define more meaningful symbolic names on top of that. Also, I think it's good practice to use upper case for all m4 defines. E.g.,
define(<STATE>, 3)
SAVE_NVOLATILE and RESTORE_NVOLATILE look a bit overkill for a single assembly function, but I guess they make sense if you plan more ppc assembly.
For LOAD_H_VEC, what alignment would you need to not use load unaligned instructions? We could consider forcing larger alignment for struct sha256_ctx. Does it matter for performance?
UPDATE_SHA_STATE looks surprisingly complicated. I guess it's alignment issues and that representation in registers is some permutation of the words as they appear in memory?
Comments on the first uses of DEQUE are a bit confusing,
C Load a-h registers from the memory pointed by state DEQUE(a, b, c, d) DEQUE(e, f, g, h)
It's not any load from memory, right, but rather some permutation of the data?
You unroll the compression function completely, 880 instructions just for the expansion of the ROUND macros. Are op-codes 32 bits, so that this is 3.5 KB code size (+ non-ROUND instructions)? This isn't terribly large, but unless you win significant performance from complete unrolling, I'd recommend unrolling only 8 or 16 rounds; that is likely enough to make loop overhead very small, and you use less of the instruction cache. (For comparison, the x86_64 versions also ends up at 3.5 KB, with 16 time unrolling).
And please add proper copyright headers. Are you the only author of this code?
Regards, /Niels
-----Original Message----- From: Niels Möller [mailto:nisse@lysator.liu.se] Sent: terça-feira, 14 de março de 2017 17:54 To: Gustavo Serra Scalet gustavo.scalet@eldorado.org.br Cc: nettle-bugs@lists.lysator.liu.se Subject: Re: Up to 7x faster sha256 for ppc64le: Please review
I'm not familiar with ppc assembly, but some comments.
Are you using some special sha instructions (e.g., vshasigmaw), or only general simd instructions? Are they always available, or do we need some compile time and/or run time check?
The machine needs to be at least a POWER8 machine. As all previous PowerPC systems were big endian only, the endianness would be enough, but I guess a compile time check that would be more specific is __POWER8_VECTOR__ == 1 (available on gcc >= 5 and clang >= 3.7). Should I add that with a compile-flag like --x86-aesni ?
In machine.m4, the aliases like
define(<r15>, <15>)
doesn't seem very helpful. If the assembly convention is that plain numbers are used to identify registers, we can stick to that for non- symbolic references, and then define more meaningful symbolic names on top of that.
The problem is that it's not very readable as you can't tell registers and constants apart. E.g: $ grep lvx -m 1 sha256-compress.s lvx 9, 0, 3 $ objdump -d sha256-compress.o | grep lvx -m 1 50: ce 18 20 7d lvx v9,0,r3
The disassembler does differs the 3 from r3 for readability (but not the 0, which is a constant) even if the plain assembly is not indicating that. And I think that's a good thing.
And don't think like "lvx always has a constant as the second argument, you just need to know it" because it's only if it's zero: $ grep lvx sha256-compress.s | grep -m 1 1, lvx 29, 1, 0 $ objdump -d sha256-compress.o | grep lvx | grep -m 1 r1 11b8: ce 00 a1 7f lvx v29,r1,r0
So it can be confusing... and that happens for a bunch of instructions, therefore I'd like to leave it as it's even if the macros seem that dumb. Is that ok?
Also, I think it's good practice to use upper case for all m4 defines. E.g.,
define(<STATE>, 3)
Ok, I can manage to do that.
SAVE_NVOLATILE and RESTORE_NVOLATILE look a bit overkill for a single assembly function, but I guess they make sense if you plan more ppc assembly.
Sha512 will need different set of non-volatile registers so it'll be handy to have that macro.
For LOAD_H_VEC, what alignment would you need to not use load unaligned instructions? We could consider forcing larger alignment for struct sha256_ctx. Does it matter for performance?
16-bytes alignment. I didn't notice a significant change in performance but it made the development a little bit trickier...
The current treat-any-alignment-type version is: lvx $1, 0, state C load unaligned addi t0, state,16 lvsr t1, 0, t0 lvx $5, 0, t0 C load unaligned vperm $1, $5, $1, t1 C a = {a,b,c,d} addi t0, t0, 16 lvx aux, 0, t0 C load unaligned vperm $5, aux, $5, t1 C e = {e,f,g,h}
and if it were 16-bytes aligned, it would simply be: lvx $1, 0, state C a = {a,b,c,d} addi t0, state,16 lvx $5, 0, t0 C e = {e,f,g,h}
and we did it on all places (input, state and k) where vectors were being read/written from/to memory.
UPDATE_SHA_STATE looks surprisingly complicated. I guess it's alignment issues and that representation in registers is some permutation of the words as they appear in memory?
Mainly it's due to alignment. Also our vector implementation also spread one element at a vector on compression algorithm: +-------+----+----+----+----+ | | Vector Words: | | Name | #1 | #2 | #3 | #4 | +-------+----+----+----+----+ | Va | a | b | c | d | +-------+----+----+----+----+
But not when memory is read/written: +-------+----+----+----+----+ | | Vector Words: | | Name | #1 | #2 | #3 | #4 | +-------+----+----+----+----+ | Va | a | - | - | - | | Vb | b | - | - | - | | Vc | c | - | - | - | | Vd | d | - | - | - | +-------+----+----+----+----+
So there is some merging/unpacking happening after LOAD_H_VEC and on UPDATE_SHA_STATE (I should probably document this. Thanks for pointing it out)
Comments on the first uses of DEQUE are a bit confusing,
C Load a-h registers from the memory pointed by state DEQUE(a, b, c, d) DEQUE(e, f, g, h)
It's not any load from memory, right, but rather some permutation of the data?
Yes, I'll change DEQUE to UNPACK. I think it'll make more sense.
You unroll the compression function completely, 880 instructions just for the expansion of the ROUND macros. Are op-codes 32 bits, so that this is 3.5 KB code size (+ non-ROUND instructions)?
Every ppc64le instruction used is 4-bytes long.
This isn't terribly large, but unless you win significant performance from complete unrolling, I'd recommend unrolling only 8 or 16 rounds; that is likely enough to make loop overhead very small, and you use less of the instruction cache. (For comparison, the x86_64 versions also ends up at 3.5 KB, with 16 time unrolling).
I rolled from 16th to 64th round and I noticed no significant performance drop despite having more branches and branch misses. The instruction count of this new algorithm I verified by "objdump -d sha256-compress.o | wc -l" and it decreased from 1161 to 477.
I have nothing against this change. I'll send it along the other changes
And please add proper copyright headers. Are you the only author of this code?
My development group is involved on this task. I'll verify what to add as the copyright holder.
Thanks for the quick reply.
Review points adjusted. See individual reviews on this separated branch:
https://git.lysator.liu.se/gut/nettle/commits/sha256-review
Points to be defined: 1) Fill copyright owner: https://git.lysator.liu.se/gut/nettle/commit/17e36f4e91d491ecc5a10ff1b567137...
2) keep or tweak --enable-ppc-vector: https://git.lysator.liu.se/gut/nettle/commit/d2371e2c4a469d6e9a3ffd9951662a4...
If I'm missing something, please advise
-----Original Message----- From: Gustavo Serra Scalet Sent: quarta-feira, 15 de março de 2017 09:27 To: 'Niels Möller' nisse@lysator.liu.se Cc: nettle-bugs@lists.lysator.liu.se Subject: RE: Up to 7x faster sha256 for ppc64le: Please review
-----Original Message----- From: Niels Möller [mailto:nisse@lysator.liu.se] Sent: terça-feira, 14 de março de 2017 17:54 To: Gustavo Serra Scalet gustavo.scalet@eldorado.org.br Cc: nettle-bugs@lists.lysator.liu.se Subject: Re: Up to 7x faster sha256 for ppc64le: Please review
I'm not familiar with ppc assembly, but some comments.
Are you using some special sha instructions (e.g., vshasigmaw), or only general simd instructions? Are they always available, or do we need some compile time and/or run time check?
The machine needs to be at least a POWER8 machine. As all previous PowerPC systems were big endian only, the endianness would be enough, but I guess a compile time check that would be more specific is __POWER8_VECTOR__ == 1 (available on gcc >= 5 and clang >= 3.7). Should I add that with a compile-flag like --x86-aesni ?
In machine.m4, the aliases like
define(<r15>, <15>)
doesn't seem very helpful. If the assembly convention is that plain numbers are used to identify registers, we can stick to that for non- symbolic references, and then define more meaningful symbolic names on top of that.
The problem is that it's not very readable as you can't tell registers and constants apart. E.g: $ grep lvx -m 1 sha256-compress.s lvx 9, 0, 3 $ objdump -d sha256-compress.o | grep lvx -m 1 50: ce 18 20 7d lvx v9,0,r3
The disassembler does differs the 3 from r3 for readability (but not the 0, which is a constant) even if the plain assembly is not indicating that. And I think that's a good thing.
And don't think like "lvx always has a constant as the second argument, you just need to know it" because it's only if it's zero: $ grep lvx sha256-compress.s | grep -m 1 1, lvx 29, 1, 0 $ objdump -d sha256-compress.o | grep lvx | grep -m 1 r1 11b8: ce 00 a1 7f lvx v29,r1,r0
So it can be confusing... and that happens for a bunch of instructions, therefore I'd like to leave it as it's even if the macros seem that dumb. Is that ok?
Also, I think it's good practice to use upper case for all m4 defines. E.g.,
define(<STATE>, 3)
Ok, I can manage to do that.
SAVE_NVOLATILE and RESTORE_NVOLATILE look a bit overkill for a single assembly function, but I guess they make sense if you plan more ppc assembly.
Sha512 will need different set of non-volatile registers so it'll be handy to have that macro.
For LOAD_H_VEC, what alignment would you need to not use load unaligned instructions? We could consider forcing larger alignment for struct sha256_ctx. Does it matter for performance?
16-bytes alignment. I didn't notice a significant change in performance but it made the development a little bit trickier...
The current treat-any-alignment-type version is: lvx $1, 0, state C load unaligned addi t0, state,16 lvsr t1, 0, t0 lvx $5, 0, t0 C load unaligned vperm $1, $5, $1, t1 C a = {a,b,c,d} addi t0, t0, 16 lvx aux, 0, t0 C load unaligned vperm $5, aux, $5, t1 C e = {e,f,g,h}
and if it were 16-bytes aligned, it would simply be: lvx $1, 0, state C a = {a,b,c,d} addi t0, state,16 lvx $5, 0, t0 C e = {e,f,g,h}
and we did it on all places (input, state and k) where vectors were being read/written from/to memory.
UPDATE_SHA_STATE looks surprisingly complicated. I guess it's alignment issues and that representation in registers is some
permutation of the
words as they appear in memory?
Mainly it's due to alignment. Also our vector implementation also spread one element at a vector on compression algorithm: +-------+----+----+----+----+ | | Vector Words: | | Name | #1 | #2 | #3 | #4 | +-------+----+----+----+----+ | Va | a | b | c | d | +-------+----+----+----+----+
But not when memory is read/written: +-------+----+----+----+----+ | | Vector Words: | | Name | #1 | #2 | #3 | #4 | +-------+----+----+----+----+ | Va | a | - | - | - | | Vb | b | - | - | - | | Vc | c | - | - | - | | Vd | d | - | - | - | +-------+----+----+----+----+
So there is some merging/unpacking happening after LOAD_H_VEC and on UPDATE_SHA_STATE (I should probably document this. Thanks for pointing it out)
Comments on the first uses of DEQUE are a bit confusing,
C Load a-h registers from the memory pointed by state DEQUE(a, b, c, d) DEQUE(e, f, g, h)
It's not any load from memory, right, but rather some permutation of the data?
Yes, I'll change DEQUE to UNPACK. I think it'll make more sense.
You unroll the compression function completely, 880 instructions just for the expansion of the ROUND macros. Are op-codes 32 bits, so that this is 3.5 KB code size (+ non-ROUND instructions)?
Every ppc64le instruction used is 4-bytes long.
This isn't terribly large, but unless you win significant performance from complete unrolling, I'd recommend unrolling only 8 or 16 rounds; that is likely enough to make loop overhead very small, and you use less of the instruction cache. (For comparison, the x86_64 versions also ends up at 3.5 KB, with 16 time unrolling).
I rolled from 16th to 64th round and I noticed no significant performance drop despite having more branches and branch misses. The instruction count of this new algorithm I verified by "objdump -d sha256-compress.o | wc -l" and it decreased from 1161 to 477.
I have nothing against this change. I'll send it along the other changes
And please add proper copyright headers. Are you the only author of this code?
My development group is involved on this task. I'll verify what to add as the copyright holder.
Thanks for the quick reply.
Gustavo Serra Scalet gustavo.scalet@eldorado.org.br writes:
Review points adjusted. See individual reviews on this separated branch:
https://git.lysator.liu.se/gut/nettle/commits/sha256-review
Points to be defined:
- Fill copyright owner:
https://git.lysator.liu.se/gut/nettle/commit/17e36f4e91d491ecc5a10ff1b567137...
This is a blocker. We don't do copyright assignments for nettle (mostly for historical reasons), and we definitely don't do any "contributor's license agreement"-silliness. But I still need to know origin of the code, preferable both who legally owns the copyright, and who actually wrote it.
- keep or tweak --enable-ppc-vector:
https://git.lysator.liu.se/gut/nettle/commit/d2371e2c4a469d6e9a3ffd9951662a4...
I think it makes sense to do something similar to --enable-arm-neon, with the default being adapting to the environment. Not sure if consulting /proc/cpuinfo is useful in this case, though, it may be sufficient to just check what the compiler or assembler accepts.
Or if all known ppc64el processors support it, maybe use of the vector instructions can be unconditional for that architecture.
Regards, /Niels
nettle-bugs@lists.lysator.liu.se