Re: [PATCH 4/4] Add AES [Enc|Dec] optimized implementations for PowerPC64

9 Jul 2020


      Maamoun TK maamoun.tk@googlemail.com writes:
...
+L16x_round_loop:

lxvd2x KX,10,KEYS
vperm   K,K,K,swap_mask
vncipher S0,S0,ZERO
vncipher S1,S1,ZERO
vncipher S2,S2,ZERO
vncipher S3,S3,ZERO
vncipher S4,S4,ZERO
vncipher S5,S5,ZERO
vncipher S6,S6,ZERO
vncipher S7,S7,ZERO
vncipher S8,S8,ZERO
vncipher S9,S9,ZERO
vncipher S10,S10,ZERO
vncipher S11,S11,ZERO
vncipher S12,S12,ZERO
vncipher S13,S13,ZERO
vncipher S14,S14,ZERO
vncipher S15,S15,ZERO
vxor S0,S0,K
vxor S1,S1,K
vxor S2,S2,K
vxor S3,S3,K
vxor S4,S4,K
vxor S5,S5,K
vxor S6,S6,K
vxor S7,S7,K
vxor S8,S8,K
vxor S9,S9,K
vxor S10,S10,K
vxor S11,S11,K
vxor S12,S12,K
vxor S13,S13,K
vxor S14,S14,K
vxor S15,S15,K
addi 10,10,0x10
bdnz L16x_round_loop

Do you really need to go all the way to 16 blocks in parallel to
saturate the execution units? I'm used to defining throughput and
latency of an instruction (e.g., vncipher) as follows:
Throughput: The number of independent vncipher instructions that can be
executed per cycle. Can be measured by benchmarking a loop of
independent instructions.
Latency: The number of cycles from the start of execution of a vncipher
instruction until execution of an instruction depending on the vncipher
result can start. Can be measured by benchmarking a loop where each
instruction depends on the result of the preceding instruction.
Do you know throughput and latency of the vncipher and vxor
instructions? (Official manuals are not always to be trusted). Those
numbers determines how much parallelism is needed, typically the product
of latency and throughput.
Regards,
/Niels
-- 
Niels Möller. PGP-encrypted email is preferred. Keyid 368C6677.
Internet email is subject to wholesale government surveillance.

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

2004

2003

2002

Re: [PATCH 4/4] Add AES [Enc|Dec] optimized implementations for PowerPC64