Eric Richter erichte@linux.ibm.com writes:
Dropping p8 support allows the use of the lxvb16x instruction, which does not need to be permuted, however that is as well a negligible performance improvement at the cost of dropping a whole cpu set. So I see a few options: A) leave as-is, consider storing the mask in a VSX register B) drop p8 support, use lxvb16x C) have a compile-time switch to use permute on p8, and use the single instruction for p9 an up.
I'd say leave as is (unless we find some way to get a spare vector register).
v3:
- use protected zone instead of allocating stack space
- add GPRs constants for multiples of 4 for loads
- around +3.4 MB/s for sha256 update
- move extend logic to its own macro called by EXTENDROUND
- use 8 VSX registers to store previous state instead of the stack
- around +11.0 MB/s for sha256 update
I think I'd be happy to merge this version, and do any incremental improvement on top of that. Some comments below:
+C ROUND(A B C D E F G H R EXT) +define(`ROUND', `
- vadduwm VT1, VK, IV($9) C VT1: k+W
- vadduwm VT4, $8, VT1 C VT4: H+k+W
- lxvw4x VSR(VK), TK, K C Load Key
- addi TK, TK, 4 C Increment Pointer to next key
- vadduwm VT2, $4, $8 C VT2: H+D
- vadduwm VT2, VT2, VT1 C VT2: H+D+k+W
Could the above two instructions be changed to
vadduwm VT2, VT4, $4 C Should be the same,(H+k+W) + D
(which would need one less register)? I realize there's slight change in the dependency chain. Do you know how many cycles one of these rounds takes, and what's the bottleneck (I would guess either latency of the dependency chain between rounds, or throughput of one of the execution units, or instruction issue rate).
+define(`LOAD', `
- IF_BE(`lxvw4x VSR(IV($1)), m4_unquote(TC`'eval(($1 % 4) * 4)), INPUT')
- IF_LE(`
lxvd2x VSR(IV($1)), m4_unquote(TC`'eval(($1 % 4) * 4)), INPUT
vperm IV($1), IV($1), IV($1), VT0
- ')
+')
+define(`DOLOADS', `
- IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')
- LOAD(0)
- LOAD(1)
- LOAD(2)
- LOAD(3)
If you pass the right TCx register as argument to the load macro, you don't need the m4 eval thing, which could make it a bit more readable, imo.
- C Store non-volatile registers
- li T0, -8
- li T1, -24
- stvx v20, T0, SP
- stvx v21, T1, SP
- subi T0, T0, 32
- subi T1, T1, 32
This could be probably be arranged with fewer instructions by having one register that is decremented as we move down in the guard area, and registers with constant values for indexing.
- C Reload initial state from VSX registers
- xxlor VSR(VT0), VSXA, VSXA
- xxlor VSR(VT1), VSXB, VSXB
- xxlor VSR(VT2), VSXC, VSXC
- xxlor VSR(VT3), VSXD, VSXD
- xxlor VSR(VT4), VSXE, VSXE
- xxlor VSR(SIGA), VSXF, VSXF
- xxlor VSR(SIGE), VSXG, VSXG
- xxlor VSR(VK), VSXH, VSXH
- vadduwm VSA, VSA, VT0
- vadduwm VSB, VSB, VT1
- vadduwm VSC, VSC, VT2
- vadduwm VSD, VSD, VT3
- vadduwm VSE, VSE, VT4
- vadduwm VSF, VSF, SIGA
- vadduwm VSG, VSG, SIGE
- vadduwm VSH, VSH, VK
It's a pity that there seems to be no useful xxadd* instructions? Do you need all eight temporary registers, or would you get the same speed doing just four at a time, i.e., 4 xxlor instructions, 4 vadduwm, 4 xxlor, 4 vadduwm? There's no alias "xxmov" or the like that could be used instead of xxlor?
Thanks for the update! /Niels