Eric Richter erichte@linux.ibm.com writes:
This patch introduces an optimized powerpc64 assembly implementation for sha256-compress-n. This takes advantage of the vshasigma instruction, as well as unrolling loops to best take advantage of running instructions in parallel.
Thanks. I'm now having a closer read of the assembly code. Comments below.
+C ROUND(A B C D E F G H R EXT) +define(`ROUND', `
- vadduwm VT1, VK, IV($9) C VT1: k+W
- vadduwm VT4, $8, VT1 C VT4: H+k+W
- lxvw4x VSR(VK), TK, K C Load Key
- addi TK, TK, 4 C Increment Pointer to next key
- vadduwm VT2, $4, $8 C VT2: H+D
- vadduwm VT2, VT2, VT1 C VT2: H+D+k+W
- vshasigmaw SIGE, $5, 1, 0b1111 C Sigma(E) Se
- vshasigmaw SIGA, $1, 1, 0 C Sigma(A) Sa
- vxor VT3, $2, $3 C VT3: b^c
- vsel VT0, $7, $6, $5 C VT0: Ch.
- vsel VT3, $3, $1, VT3 C VT3: Maj(a,b,c)
- vadduwm VT4, VT4, VT0 C VT4: Hkw + Ch.
- vadduwm VT3, VT3, VT4 C VT3: HkW + Ch. + Maj.
- vadduwm VT0, VT0, VT2 C VT0: Ch. + DHKW
- vadduwm $8, SIGE, SIGA C Anext: Se + Sa
- vadduwm $4, VT0, SIGE C Dnext: Ch. + DHKW + Se
- vadduwm $8, $8, VT3 C Anext: Se+Sa+HkW+Ch.+Maj.
- C Schedule (data) for 16th round in future
- C Extend W[i]
- ifelse(`$10', `1', `
vshasigmaw SIGE, IV($9 + 14), 0, 0b1111
vshasigmaw SIGA, IV($9 + 1), 0, 0b0000
vadduwm IV($9), IV($9), SIGE
vadduwm IV($9), IV($9), SIGA
vadduwm IV($9), IV($9), IV($9 + 9)
- ')
+')
I think it would be a bit simpler to take out the extend logic to its own macro.
+define(`EXTENDROUND', `ROUND($1, $2, $3, $4, $5, $6, $7, $8, $9, 1)')
If you do that, then you would define
define(`EXTENDROUND', `ROUND($1, $2, $3, $4, $5, $6, $7, $8, $9) EXTEND($9)')
(In other related code, input expansion is done at the beginning of a round iteration rather than at the end, but doing at the end like you do may be better scheduling).
+define(`LOAD', `
- IF_BE(`lxvw4x VSR(IV($1)), 0, INPUT')
- IF_LE(`
lxvd2x VSR(IV($1)), 0, INPUT
vperm IV($1), IV($1), IV($1), VT0
- ')
- addi INPUT, INPUT, 4
+')
+define(`DOLOADS', `
- IF_LE(`DATA_LOAD_VEC(VT0, .load_swap, T1)')
Could you have a dedicated register for the permutation constant, and load it only once at function entry? If you have general registers to spare, it could also make sense to use, e.g., three registers for the contant values 16, 32, 48, and use for indexing. Then you don't need to update the INPUT pointer as often, and you can use the same constants for other load/store sequences as well.
- LOAD(0)
- LOAD(1)
- LOAD(2)
- LOAD(3)
+PROLOGUE(_nettle_sha256_compress_n)
- cmpwi 0, NUMBLOCKS, 0
- ble 0, .done
- mtctr NUMBLOCKS
- C Store non-volatile registers
- subi SP, SP, 64+(12*16)
- std T0, 24(SP)
- std T1, 16(SP)
- std COUNT, 8(SP)
For save/restore of registers, I prefer to use the register names, not the defined symbols. And T0, T1, COUNT are defined to use r7, r8, r10, which *are* volatile, right?
Does the data stored fit in the 288 byte "protected zone"? If so, probably best to not modify the stack pointer.
- li T0, 32
- stvx v20, 0, SP
- subi T0, T0, 16
- stvx v21, T0, SP
Here it would also help a bit to allocate constants 16, 32, 48 in registers.
- subi T0, T0, 16
- stvx v22, T0, SP
- subi T0, T0, 16
- stvx v23, T0, SP
- subi T0, T0, 16
- stvx v24, T0, SP
- subi T0, T0, 16
- stvx v25, T0, SP
- subi T0, T0, 16
- stvx v26, T0, SP
- subi T0, T0, 16
- stvx v27, T0, SP
- subi T0, T0, 16
- stvx v28, T0, SP
- subi T0, T0, 16
- stvx v29, T0, SP
- subi T0, T0, 16
- stvx v30, T0, SP
- subi T0, T0, 16
- stvx v31, T0, SP
- C Load state values
- li T0, 16
- lxvw4x VSR(VSA), 0, STATE C VSA contains A,B,C,D
- lxvw4x VSR(VSE), T0, STATE C VSE contains E,F,G,H
+.loop:
- li TK, 0
- lxvw4x VSR(VK), TK, K
- addi TK, TK, 4
- DOLOADS
- C "permute" state from VSA containing A,B,C,D into VSA,VSB,VSC,VSD
Can you give a bit more detail on this permutation? Does the main round operations only use 32 bits each from the state registers? There's no reasonable way to use a more compact representation?
- vsldoi VSB, VSA, VSA, 4
- vsldoi VSF, VSE, VSE, 4
- vsldoi VSC, VSA, VSA, 8
- vsldoi VSG, VSE, VSE, 8
- vsldoi VSD, VSA, VSA, 12
- vsldoi VSH, VSE, VSE, 12
- EXTENDROUNDS
- EXTENDROUNDS
- EXTENDROUNDS
- NOEXTENDROUNDS
- C Reload initial state from stack
- li T0, 16
- lxvw4x VSR(VT0), 0, STATE C VSA contains A,B,C,D
- lxvw4x VSR(VT1), T0, STATE C VSE contains E,F,G,H
- C Repack VSA,VSB,VSC,VSD into VSA,VSE for storing
- vmrghw VSA, VSA, VSB
- vmrghw VSC, VSC, VSD
- vmrghw VSE, VSE, VSF
- vmrghw VSG, VSG, VSH
- xxmrghd VSR(VSA), VSR(VSA), VSR(VSC)
- xxmrghd VSR(VSE), VSR(VSE), VSR(VSG)
- vadduwm VSA, VSA, VT0
- vadduwm VSE, VSE, VT1
It seems unfortunate to have to do this conversion for each iteration of the loop, it would be nice if state could be converted to the most efficient form before entering the loop, and not converted back until after loop exit. But we probably don't have enoguh registers to keep the old state exploded into many registers. And load/store of exploded state doesn't seem that attractive either.
- li T0, 16
- stxvw4x VSR(VSA), 0, STATE
- stxvw4x VSR(VSE), T0, STATE
- bdnz .loop
Regards, /Niels Möller