Re: Up to 7x faster sha256 for ppc64le: Please review

14 Mar 2017


      Gustavo Serra Scalet gustavo.scalet@eldorado.org.br writes:
...
I coded a high performance sha256 algorithm for ppc64le:
https://git.lysator.liu.se/gut/nettle/commit/a8facb03a69787a93c91b426f32a0be...
Cool!
...
Tests were performed with different files and comparing it against the
original C implementation by using the Ubuntu's 16.10 libnettle.so.6
by using the following code:
https://gist.github.com/gut/9622d4535a9e3f9ea3b0ded2762d4b28
You could also use the nettle-hash program.
I'm not familiar with ppc assembly, but some comments.
Are you using some special sha instructions (e.g., vshasigmaw), or only
general simd instructions? Are they always available, or do we need some
compile time and/or run time check?
In machine.m4, the aliases like
define(<r15>, <15>)
doesn't seem very helpful. If the assembly convention is that plain
numbers are used to identify registers, we can stick to that for
non-symbolic references, and then define more meaningful symbolic names
on top of that. Also, I think it's good practice to use upper case for
all m4 defines. E.g.,
define(<STATE>, 3)
SAVE_NVOLATILE and RESTORE_NVOLATILE look a bit overkill for a single
assembly function, but I guess they make sense if you plan more ppc
assembly.
For LOAD_H_VEC, what alignment would you need to not use load unaligned
instructions? We could consider forcing larger alignment for struct
sha256_ctx. Does it matter for performance?
UPDATE_SHA_STATE looks surprisingly complicated. I guess it's alignment
issues and that representation in registers is some permutation of the
words as they appear in memory?
Comments on the first uses of DEQUE are a bit confusing,
C Load a-h registers from the memory pointed by state
  DEQUE(a, b, c, d)
  DEQUE(e, f, g, h)
It's not any load from memory, right, but rather some permutation of the
data?
You unroll the compression function completely, 880 instructions just
for the expansion of the ROUND macros. Are op-codes 32 bits, so that
this is 3.5 KB code size (+ non-ROUND instructions)? This isn't terribly
large, but unless you win significant performance from complete
unrolling, I'd recommend unrolling only 8 or 16 rounds; that is likely
enough to make loop overhead very small, and you use less of the
instruction cache. (For comparison, the x86_64 versions also ends up at
3.5 KB, with 16 time unrolling).
And please add proper copyright headers. Are you the only author of this
code?
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: Up to 7x faster sha256 for ppc64le: Please review