RE: Up to 7x faster sha256 for ppc64le: Please review

15 Mar 2017


      Review points adjusted. See individual reviews on this separated branch:
https://git.lysator.liu.se/gut/nettle/commits/sha256-review
Points to be defined:
1) Fill copyright owner: https://git.lysator.liu.se/gut/nettle/commit/17e36f4e91d491ecc5a10ff1b567137...
2) keep or tweak --enable-ppc-vector: https://git.lysator.liu.se/gut/nettle/commit/d2371e2c4a469d6e9a3ffd9951662a4...
If I'm missing something, please advise
...
-----Original Message-----
From: Gustavo Serra Scalet
Sent: quarta-feira, 15 de março de 2017 09:27
To: 'Niels Möller' nisse@lysator.liu.se
Cc: nettle-bugs@lists.lysator.liu.se
Subject: RE: Up to 7x faster sha256 for ppc64le: Please review
...
-----Original Message-----
From: Niels Möller [mailto:nisse@lysator.liu.se]
Sent: terça-feira, 14 de março de 2017 17:54
To: Gustavo Serra Scalet gustavo.scalet@eldorado.org.br
Cc: nettle-bugs@lists.lysator.liu.se
Subject: Re: Up to 7x faster sha256 for ppc64le: Please review
I'm not familiar with ppc assembly, but some comments.
Are you using some special sha instructions (e.g., vshasigmaw), or
only general simd instructions? Are they always available, or do we
need some compile time and/or run time check?
The machine needs to be at least a POWER8 machine. As all previous
PowerPC systems were big endian only, the endianness would be enough,
but I guess a compile time check that would be more specific is
__POWER8_VECTOR__ == 1 (available on gcc >= 5 and clang >= 3.7). Should
I add that with a compile-flag like --x86-aesni ?
...
In machine.m4, the aliases like
define(<r15>, <15>)
doesn't seem very helpful. If the assembly convention is that plain
numbers are used to identify registers, we can stick to that for non-
symbolic references, and then define more meaningful symbolic names on
top of that.
The problem is that it's not very readable as you can't tell registers
and constants apart. E.g:
$ grep lvx -m 1 sha256-compress.s
  lvx    9,  0,    3
$ objdump -d sha256-compress.o | grep lvx -m 1
      50:       ce 18 20 7d     lvx     v9,0,r3
The disassembler does differs the 3 from r3 for readability (but not the
0, which is a constant) even if the plain assembly is not indicating
that. And I think that's a good thing.
And don't think like "lvx always has a constant as the second argument,
you just need to know it" because it's only if it's zero:
$ grep lvx sha256-compress.s | grep -m 1 1,
    lvx 29, 1, 0
$ objdump -d sha256-compress.o | grep lvx | grep -m 1 r1
    11b8:       ce 00 a1 7f     lvx     v29,r1,r0
So it can be confusing... and that happens for a bunch of instructions,
therefore I'd like to leave it as it's even if the macros seem that
dumb. Is that ok?
...
Also, I think it's good practice to use upper case for all
m4 defines. E.g.,
define(<STATE>, 3)
Ok, I can manage to do that.
...
SAVE_NVOLATILE and RESTORE_NVOLATILE look a bit overkill for a single
assembly function, but I guess they make sense if you plan more ppc
assembly.
Sha512 will need different set of non-volatile registers so it'll be
handy to have that macro.
...
For LOAD_H_VEC, what alignment would you need to not use load
unaligned instructions? We could consider forcing larger alignment for
struct sha256_ctx. Does it matter for performance?
16-bytes alignment. I didn't notice a significant change in performance
but it made the development a little bit trickier...
The current treat-any-alignment-type version is:
  lvx    $1,  0,    state         C load unaligned
  addi   t0,  state,16
  lvsr   t1,  0,    t0
  lvx    $5,  0,    t0            C load unaligned
  vperm  $1,  $5,   $1,  t1       C a = {a,b,c,d}
  addi   t0,  t0,   16
  lvx    aux, 0,    t0            C load unaligned
  vperm  $5,  aux,  $5,  t1       C e = {e,f,g,h}
and if it were 16-bytes aligned, it would simply be:
  lvx    $1,  0,    state         C a = {a,b,c,d}
  addi   t0,  state,16
  lvx    $5,  0,    t0            C e = {e,f,g,h}
and we did it on all places (input, state and k) where vectors were
being read/written from/to memory.
...
UPDATE_SHA_STATE looks surprisingly complicated. I guess it's
alignment issues and that representation in registers is some
permutation of the
...
words as they appear in memory?
Mainly it's due to alignment. Also our vector implementation also spread
one element at a vector on compression algorithm:
+-------+----+----+----+----+
|       |   Vector Words:   |
| Name  | #1 | #2 | #3 | #4 |
+-------+----+----+----+----+
| Va    | a  | b  | c  | d  |
+-------+----+----+----+----+
But not when memory is read/written:
+-------+----+----+----+----+
|       |   Vector Words:   |
| Name  | #1 | #2 | #3 | #4 |
+-------+----+----+----+----+
| Va    | a  | -  | -  | -  |
| Vb    | b  | -  | -  | -  |
| Vc    | c  | -  | -  | -  |
| Vd    | d  | -  | -  | -  |
+-------+----+----+----+----+
So there is some merging/unpacking happening after LOAD_H_VEC and on
UPDATE_SHA_STATE (I should probably document this. Thanks for pointing
it out)
...
Comments on the first uses of DEQUE are a bit confusing,
C Load a-h registers from the memory pointed by state
  DEQUE(a, b, c, d)
  DEQUE(e, f, g, h)
It's not any load from memory, right, but rather some permutation of
the data?
Yes, I'll change DEQUE to UNPACK. I think it'll make more sense.
...
You unroll the compression function completely, 880 instructions just
for the expansion of the ROUND macros. Are op-codes 32 bits, so that
this is 3.5 KB code size (+ non-ROUND instructions)?
Every ppc64le instruction used is 4-bytes long.
...
This isn't terribly
large, but unless you win significant performance from complete
unrolling, I'd recommend unrolling only 8 or 16 rounds; that is likely
enough to make loop overhead very small, and you use less of the
instruction cache. (For comparison, the x86_64 versions also ends up
at
3.5 KB, with 16 time unrolling).
I rolled from 16th to 64th round and I noticed no significant
performance drop despite having more branches and branch misses. The
instruction count of this new algorithm I verified by "objdump -d
sha256-compress.o | wc -l" and it decreased from 1161 to 477.
I have nothing against this change. I'll send it along the other changes
...
And please add proper copyright headers. Are you the only author of
this code?
My development group is involved on this task. I'll verify what to add
as the copyright holder.
Thanks for the quick reply.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

RE: Up to 7x faster sha256 for ppc64le: Please review