Maamoun TK maamoun.tk@googlemail.com writes:
+L16x_round_loop:
- lxvd2x KX,10,KEYS
- vperm K,K,K,swap_mask
- vncipher S0,S0,ZERO
- vncipher S1,S1,ZERO
- vncipher S2,S2,ZERO
- vncipher S3,S3,ZERO
- vncipher S4,S4,ZERO
- vncipher S5,S5,ZERO
- vncipher S6,S6,ZERO
- vncipher S7,S7,ZERO
- vncipher S8,S8,ZERO
- vncipher S9,S9,ZERO
- vncipher S10,S10,ZERO
- vncipher S11,S11,ZERO
- vncipher S12,S12,ZERO
- vncipher S13,S13,ZERO
- vncipher S14,S14,ZERO
- vncipher S15,S15,ZERO
- vxor S0,S0,K
- vxor S1,S1,K
- vxor S2,S2,K
- vxor S3,S3,K
- vxor S4,S4,K
- vxor S5,S5,K
- vxor S6,S6,K
- vxor S7,S7,K
- vxor S8,S8,K
- vxor S9,S9,K
- vxor S10,S10,K
- vxor S11,S11,K
- vxor S12,S12,K
- vxor S13,S13,K
- vxor S14,S14,K
- vxor S15,S15,K
- addi 10,10,0x10
- bdnz L16x_round_loop
Do you really need to go all the way to 16 blocks in parallel to saturate the execution units? I'm used to defining throughput and latency of an instruction (e.g., vncipher) as follows:
Throughput: The number of independent vncipher instructions that can be executed per cycle. Can be measured by benchmarking a loop of independent instructions.
Latency: The number of cycles from the start of execution of a vncipher instruction until execution of an instruction depending on the vncipher result can start. Can be measured by benchmarking a loop where each instruction depends on the result of the preceding instruction.
Do you know throughput and latency of the vncipher and vxor instructions? (Official manuals are not always to be trusted). Those numbers determines how much parallelism is needed, typically the product of latency and throughput.
Regards, /Niels